[07:41:33] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11519843 (10ayounsi) As it's not a timeout, but a TTL issue, that might match some transport link "event" causing this brief alert. VMs are now 1 extra ro... [08:01:29] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11519863 (10JAllemandou) >>! In T414460#11518808, @CDanis wrote: > > The spike a few days after the start of the month is interest... [08:52:21] 10netops, 06Infrastructure-Foundations, 10netbox: Automatically run Capirca Netbox script regularly - https://phabricator.wikimedia.org/T361549#11520056 (10ayounsi) Thanks to the latest patches, it's now possible to see if there are pending changes to be committed to the Capirca file. Just run the script wit... [09:05:48] Hi! I'm working on making the REST gateway return a Retry-After header when requests get rate limited. As an aside, I was going to add a default Retry-After of 60 seconds for 503 and 504 responses, if the backend doesn't specify a value. Does that sound good? Or would it cause problems? [09:05:48] My thinking was that it would be nice to consistently return Retry-After with all 429 and 503 and 504 responses from all APIs. [09:11:14] I think a Retry of 60s for 503 and 504 is ok, if the backend doesn't specify it correctly, it's already better than the existing situation! [09:51:20] 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11520339 (10Clement_Goubert) >>! In T412586#11518896, @Scott_French wrote: > @Clement_Goubert @daniel - If you could provide more detail on s... [11:05:48] fabfur: thanks! [11:41:49] duesen: afaik Retry-After isn't used for 504s [11:43:27] see https://www.rfc-editor.org/rfc/rfc9110#status.503 VS https://www.rfc-editor.org/rfc/rfc9110#status.504 [12:06:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520728 (10cmooney) @VRiley-WMF I'll ping you on irc but we want to go ahead and replace the DAC on //d... [12:08:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520733 (10cmooney) Hmm so I was going to see if there was any difference if I did a trace to the ceph... [12:19:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520751 (10cmooney) Also @VRiley-WMF it seems this is actually a 1G RJ45 link. So let's swap the coppe... [12:33:06] vgutierrez: I can remove it, but it does seem useful to me. A 504 typically seld-corrects after some time, but clients shouldn't just hammer us until it does... I find it curious that retry-after isn't specified for 504, since retrying is the only way to resolve a 504 situation... [12:36:38] I couldn't find any useful discussions on that, only hints to retry-after being intended for cases when it is possible to predict when a retry would be successful, and that clients should use incremental back-off for unpredictable transient errors... [12:37:08] Is that your thinking as well? [12:55:32] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11520980 (10cmooney) Hmm so with the node un-cordoned the loss has not returned either, well one drop at the first hop but it seems insigni... [13:18:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521085 (10cmooney) >>! In T414460#11518808, @CDanis wrote: > FIN_WAIT_1 is //not// supposed to stick around for longer than a minute or t... [13:29:32] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521110 (10Vgutierrez) [14:09:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521279 (10cmooney) >>! In T414460#11521085, @cmooney wrote: > however surely it should try to resend the FIN, and if this state persists... [14:26:21] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521320 (10Vgutierrez) [14:32:29] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521361 (10Vgutierrez) the headers described on https://wikitech.wikimedia.org/wiki/CDN/Backe... [14:33:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, and 2 others: Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521367 (10CDanis) >>! In T414460#11521085, @cmooney wrote: > The k8s host sent a FIN to the remote side but due to the packet-loss issue... [15:16:54] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11521596 (10Vgutierrez) [15:34:50] 06Traffic, 07Essential-Work, 05MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), 13Patch-For-Review, 06Test Kitchen (Test Kitchen (Experiment Platform Sprint 18)): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11521665 (10Sfaci) @ss... [15:59:01] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521815 (10Tgr) a:03Tgr [15:59:54] duesen: clients shouldn't hammer us, but if clients don't expect a Retry-After header in a 504 they won't use it and you'd be bloating the response for no reason [16:00:25] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521825 (10Tgr) We should also update some of the dashboards (at least the login one) with so... [16:00:57] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11521827 (10cmooney) The SFP module in port 14 of lsw1-c5-eqiad has been swapped out now. So we can observe over the next... [16:02:56] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11521866 (10Tgr) >>! In T412396#11521361, @Vgutierrez wrote: > the headers described on https:... [16:04:32] 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11521883 (10Scott_French) p:05Triage→03Low [16:05:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11521900 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=cf1deaa2-45c3-45e8-bdad-1303b0075f87) set by pt1979@cumin2002 for 2:00:00 on... [16:18:28] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11522035 (10Vgutierrez) [16:35:54] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522109 (10cmooney) Ok currently seeing no loss (though that was the case when we were cordoned before the swap). ` cmoon... [16:49:13] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522186 (10ops-monitoring-bot) Host dse-k8s-worker1013.eqiad.wmnet rebooted by brouberol@cumin1003 with reason: Getting a... [16:50:10] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522190 (10VRiley-WMF) Happy to help with this. Let us know if there is anything else we can help with. [17:31:05] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522388 (10akosiaris) [17:36:13] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11522428 (10cmooney) Thanks @VRiley. Happy to say we aren't seeing any loss as of yet after the node was uncordoned: ` cm... [18:17:54] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522518 (10ssingh) >>! In T414473#11519843, @ayounsi wrote: > As it's not a timeout, but a TTL issue, that might match some transport link "event" causin... [18:28:03] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522551 (10ssingh) @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service IPs `2620:0:860:ed1a::/64`, since unfortuna... [19:02:02] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11522682 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8cc58471-31d6-4e79-ae14-124cd9a6b684) set by pt1979@cumin2002 for 1:00:00 on... [19:20:18] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11522747 (10taavi) 05Stalled→03Open [19:22:39] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11522769 (10ssingh) This time on physical hosts: ` 14:20:36 <+icinga-wm> PROBLEM - Host cp7016 is DOWN: CRITICAL - Time to live exceeded (10.140.1.11) 14... [19:54:25] could I get a look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226932 from sukhe or someone? :D [19:59:41] cdanis: looking [20:05:13] thanks! is that kind of change -- adding addl realserver IPs to pooled-for-other-services nodes -- spooky to roll out? [20:05:53] (at a minimum I was thinking disable puppet on A:cp-text and then roll forward one node in magru by hand) [20:07:29] cdanis: I think we have done a few for other nodes, but not the cp ones. in that regard, the cp ones tend to be more scary anyway but I think it should be OK? [20:07:44] yeah, disabling puppet and a quick test on one host should tell us if things are not right [20:07:49] coolcool [20:08:02] and yeah if that fails or obviously messes up, I'll just roll back the first patch [20:08:04] thanks [20:08:31] I guess we could do one more thing but I haven't thought it through [20:08:37] 07HTTPS, 06Traffic, 06SRE: implement Public Key Pinning (HPKP) for Wikimedia domains - https://phabricator.wikimedia.org/T92002#11522895 (10Izno) [20:08:38] we could simply add the override for magru itself for cp-text [20:08:52] though I don't think realserver::pools does a merge so that probably won't work on the hash hmm [20:08:55] ok, never mind [20:09:53] what I was saying was that if "profile::lvs::realserver::pools" had merge => hash set in the lookup, we could set an override just for magru and add gerrit-https there [20:10:00] that would then give us the existing ones + gerrit-https, just for magru [20:10:03] but it's fine [20:10:45] yeah you can override merge settings globally I think [20:10:53] if you do it from the appropriate level of hiera resolution [20:13:54] yep. but that is if we really want to restrict this but since we are doing magru + global, let's just go ahead [20:15:39] v.g. might disagree so if you want to wait for him, that's also fine (I trust him more than I trust myself anyway) [20:17:07] 🤠 [20:21:37] swfrench-wmf: ChrisDobbins901_: about to mess with just one cp host in magru, and then potentially all of them, not expecting trouble just fyi [20:21:55] * swfrench-wmf thumbs up [20:28:40] cool, that was totally hitless afaict [20:30:16] https://puppetboard.wikimedia.org/report/cp7001.magru.wmnet/74b629f35c05395b0fcd8bf21e5727e4ca0b4891 [20:30:20] looks good I think [20:30:37] yeah I was watching a bunch of stuff live [20:30:51] the s/text-https/gerrit-https thing is throwing me off but we will learn to live with it [20:31:32] yeah ... I think that was unavoidable with this approach unfortunately [20:31:38] FIRING: [2x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:31:43] hmm [20:33:13] that might just be a race condition? [20:33:20] that's probably because of the switch but looks good otherwise [20:35:03] cdanis: I guess if we really really want to be sure, maybe you can try it on more host (the one you are hitting in eqiad) before rolling out [20:36:17] stepping out for school pickup but will be back in 10ish. [20:37:18] I'm pretty confident, aside from the MSS issue [20:41:10] try it one more host and see if it happens. in some ways, that might just be from the new gerrit lb connection being established [20:43:09] if not then yeah we can rollback and do it in the morning when vg is around. but I think that was a false positive, much like we see during a reboot [20:43:12] > If for some reason the kernel is unable to answer to the initial SYN packet or it answers with an RST packet, this alert will be trigger a false positive. [20:44:14] yeah [20:44:22] I continued with all of cp-text in magru [20:44:53] going to enable-puppet on cp-text globally soon [20:45:15] and then going to mess with LVS in magru [20:46:28] enabling puppet now, going to just let it roll because it did look hitless [20:46:38] FIRING: [12x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:46:49] ah [20:48:36] what's going on? [20:48:46] vgutierrez: I'm rolling back a patch :3 [20:48:51] that's no good [20:48:58] vgutierrez: it's only for the new gerrit VIP [20:49:18] ohh [20:49:31] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1226932 [20:50:57] Loaded: loaded (/lib/systemd/system/tcp-mss-clamper.service; enabled; vendor preset: enabled) [20:50:58] Active: active (running) since Tue 2025-10-14 13:40:57 UTC; 3 months 0 days ago [20:51:35] I guess we're missing a refresh dependency in puppet somewhere [20:51:38] FIRING: [16x] LVSRealserverMSS: Unexpected MSS value on 195.200.68.225:443 @ cp7001 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=magru&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:51:42] yeah... tcp mss clamper needs to be manually restarted [20:51:43] └─1968 /usr/bin/tcp-mss-clamper --ipv4-mss 1440 --ipv6-mss 1400 -p :2200 -s 195.200.68.224:443,195.200.68.224:80,[2a02:ec80:700:ed1a::1]:443,[2a02:ec80:700:ed1a::1]:80 -i eno12399np0,lo [20:51:45] before [20:51:51] and that requires a depool [20:51:58] that's by design [20:52:11] oh [20:52:39] well that's a big TIL [20:52:43] same [20:52:51] we should document it if not already [20:52:56] or limitation given adding vips to realservers doesn't happen too often [20:53:00] been a while since we did this [20:53:06] yeah I guess [20:53:33] why does it require a depool [20:53:54] because uh I had already done it to all magru just before you said that [20:54:08] cdanis: let me know if you need an extra pair of hands. I will be online soon [20:54:11] thanks <3 [20:54:53] because restarting tcp mss clamper will stop mss clamping for a few milliseconds [20:54:59] oh [20:55:07] and connections accepted during that time won't be clamped [20:55:28] are you sure that's actually less impactful than a depool? [20:55:35] er, more [20:56:38] FIRING: [24x] LVSRealserverMSS: Unexpected MSS value on 185.15.58.225:443 @ cp6009 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [20:57:25] re-disabled puppet on A:cp-text [20:57:30] hmmm not really [20:57:40] it should be really fast [20:58:24] I did it percussively on magru [20:58:34] and there's no obvious impact 😅 [20:59:09] if katran drops connections we know why :) [20:59:33] I would never blame you for my own yeehaw [21:00:23] ok I'm going to continue then [21:00:25] thanks vg <3 [21:01:38] FIRING: [40x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5023 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:03:01] np [21:06:38] FIRING: [46x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5020 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:09:34] cdanis: here now. can I help? sorry :D [21:10:23] sukhe: seems like that restarting tcp-mss-clamper doesn't strictly require a depool [21:10:55] Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="control plane is now aware of the current status of all realservers" service=gerrit-httpslb6_443 [21:10:57] Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="new healthcheck result received" service=gerrit-httpslb_443 hostname=cp7002.magru.wmnet address=10.140.1.4 healthcheck_name=HTTPCheck healthcheck_id=2911427075 healthcheck_result=true [21:10:59] Jan 14 21:10:12 lvs7003 libericad[1528]: time=2026-01-14T21:10:12.566Z level=INFO msg="control plane is now aware of the current status of all realservers" service=gerrit-httpslb_443 [21:11:01] very cool :D [21:11:38] FIRING: [78x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5020 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:12:09] those should clear soon [21:12:39] sukhe: I think I will do https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215398 and then stop today ? [21:13:22] if you are ready :D [21:16:38] FIRING: [78x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:19:36] ok, I'll stop for now, we can do drmrs/the world tomorrow [21:19:40] thanks for all the help! [21:20:01] haha, you did most of it. sorry about not knowing about the restart part [21:20:12] all good on the reloads in magru? [21:20:17] yep! [21:20:19] nice [21:20:21] on lvs7003 and 7001 [21:20:25] yep sounds good [21:20:26] that part was very easy [21:20:58] yep, it's very nice [21:21:38] RESOLVED: [84x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5017 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:23:14] no pending alerts, we are all clear [21:23:33] and -- it works! [21:24:08] well hm. port 443 works [21:24:20] that is all that should work though right [21:24:31] I thought the tcp-proxies were ready to accept traffic [21:24:51] I'll check, but not a concern rn [21:25:40] I vaguely recall that something was left but let me check the task [21:25:41] yeah [21:26:21] nope, https://phabricator.wikimedia.org/T408064 [21:26:36] the magru proxies certainly should be up [21:27:26] there was an issue with their provisioning, and what I had in mind, but it seems like per the above, everything was done [21:27:34] `liberica cp services` shows empty pools for the gerrit 29418 services [21:27:56] so I'm missing something in the puppet side [21:30:05] indeed [21:30:26] class: high-traffic1 [21:30:26] conftool: [21:30:26] cluster: tcp-proxy [21:30:26] service: gerrit [21:31:01] I thought service there was supposed to match [21:31:07] profile::lvs::realserver::pools: [21:31:09] gerrit-ssh: [21:31:11] services: [21:31:13] - gerrit [21:32:05] and also uh [21:32:08] conftool-data/node/magru.yaml [21:32:09] 26: tcp-proxy: [21:32:11] 27: tcp-proxy7001.magru.wmnet: [gerrit] [21:32:13] 28: tcp-proxy7002.magru.wmnet: [gerrit] [21:32:19] er also [21:32:23] these are in service_setup? [21:32:28] lvs_setup ? [21:32:56] yeah sorry, updated [21:33:01] so that matches at least [21:33:41] this is in the look at it from scratch stage now [21:33:51] basically start from https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service and see what we are missing [21:34:05] [except the instructions are for eqiad/codfw, but besides that] [21:35:23] lol https://config-master.wikimedia.org/pybal/magru/gerrit-ssh [21:35:35] hahaha [21:36:49] sukhe@puppetserver1001:~$ sudo confctl select 'cluster=tcp-proxy' get [21:36:57] might as well enable it all [21:37:08] FIRING: [32x] LVSRealserverMSS: Unexpected MSS value on 103.102.166.225:443 @ cp5024 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [21:37:15] gerrit-sshlb_29418: [21:37:17] 10.140.2.10 1 healthy: false | pooled: depool-blocked [21:37:21] hey what really [21:38:33] 2a02:ec80:700:103:10:140:2:10 1 healthy: false | pooled: depool-blocked [21:38:46] probably meaning that the depool threshold is keeping it alive? [21:38:52] oh I meant the MSS alerts [21:38:54] but yeah [21:39:11] the MSS alert might just be delayed (sigh) since I can't find anything in the alertmanager [21:39:13] they look to be for gerrit-lb again [21:41:17] I am not sure why it says not healthy though [21:41:18] sukhe@lvs7001:~$ nc tcp-proxy7001.magru.wmnet 29418 [21:41:18] SSH-2.0-GerritCodeReview_3.10.6 (APACHE-SSHD-2.12.0) [21:41:31] let's pool both in magru? [21:41:42] the v6 is unhappy [21:41:54] ahh [21:41:56] hm [21:42:36] LISTEN 0 1024 *:29418 *:* users:(("haproxy",pid=696,fd=9)) [21:42:57] do we have to do something silly like have it listen on both 0.0.0.0 and [::] [21:43:11] except that sukhe@lvs7001:~$ nc -6 tcp-proxy7001.magru.wmnet 29418 [21:43:13] looks happy [21:43:22] and that's kinda all of the healthcheck anyway? [21:43:34] maybe [21:43:36] there is no proxyfetch here (doesn't need to be) [21:44:10] Jan 14 21:41:52 lvs7001 libericad[1436749]: time=2026-01-14T21:41:52.947Z level=WARN msg="unable to depool due to depool threshold enforcement" service=gerrit-sshlb6_29418 hostname=tcp-proxy7002.magru.wmnet address=2a02:ec80:700:103:10:14> [21:44:16] so that confirms why it says depool-blocked [21:44:19] not why it can't reach it though [21:45:19] this is where I vaguely recall something wrong with the magru hosts related to v6 [21:46:08] but yeah let's look at this tomorrow now, other than the alert, nothing really is breaking [21:46:28] isn't v4 ssh unhealthy too? [21:46:43] gerrit-sshlb_29418: [21:46:45] 10.140.2.11 1 healthy: false | pooled: depool-blocked [21:46:46] yep, it is. I only remember a v6 issue though, during provisioning [21:46:47] 10.140.2.10 1 healthy: false | pooled: depool-blocked [21:46:49] yeah [21:47:06] we haven't validated that haproxy can receive ip-tunnelled ssh yet :) [21:47:10] tcp-proxy haproxy, that is [21:47:23] anyway, stopping for now [21:47:25] thanks again! [21:47:26] yep :) [21:48:18] we can change the LVS state to service_setup for the gerrit related alerts if we want to silence them [21:48:21] I will check later [21:50:17] hmm are the instances ready to receive IPIP traffic? [21:50:34] I can check that tomorrow morning [21:52:11] https://gitlab.wikimedia.org/-/snippets/107 should do it [21:53:48] nice. please go offline no [21:53:49] w [21:58:10] if ipip0 and ipip60 are there [21:58:15] vgutierrez: buying some beer for you in Lisbon :D [21:58:20] you're probably missing rhe fw rules [21:58:43] ahhh [21:58:50] to allow inbound traffic from the IP space we use for IPIP [21:59:09] check ncredir puppetization [21:59:21] * vgutierrez going to sleep now