[01:25:40] FIRING: VarnishPrometheusExporterDown: Varnish Exporter on instance cp5022:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [01:25:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp5022 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [02:20:43] FIRING: HaproxyKafkaExporterDown: HaproxyKafka on cp5022 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=eqsin&var-instance=cp5022 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [09:07:51] FIRING: FermMSS: Unexpected MSS value on 10.2.2.44:443 @ registry1005 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=4&var-site=eqiad&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:12:51] FIRING: [2x] FermMSS: Unexpected MSS value on 10.2.1.44:443 @ registry2004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:14:33] 06Traffic, 06DC-Ops, 10ops-eqsin: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411 (10Vgutierrez) 03NEW [09:14:49] 06Traffic, 06DC-Ops, 10ops-eqsin: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11515417 (10Vgutierrez) p:05Triage→03Medium [09:17:51] FIRING: [3x] FermMSS: Unexpected MSS value on 10.2.1.44:443 @ registry2004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:22:51] FIRING: [4x] FermMSS: Unexpected MSS value on 10.2.1.44:443 @ registry2004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:27:51] RESOLVED: [4x] FermMSS: Unexpected MSS value on 10.2.1.44:443 @ registry2004 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://alerts.wikimedia.org/?q=alertname%3DFermMSS [09:48:20] 06Traffic, 10Liberica: Reduce the chances of false positives on MSS clamping alerts - https://phabricator.wikimedia.org/T400155#11515503 (10JMeybohm) We just got this as a red herring during a registry outage where nginx was failing to start (so nothing listening) [09:55:31] 06Traffic, 06SRE: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11515509 (10Fabfur) @Xqt we're rolling out a change that should lift the current ratelimiting and impact Pywikibot too, could you please check in ~30 minutes if yo... [10:28:19] 06Traffic, 06SRE: Wiki Education Dashboard being rate-limited for OAuth login and token fetching - https://phabricator.wikimedia.org/T414114#11515627 (10Joe) Hi, I still see a lot of requests from your IPs with user-agent `Faraday v2.14.0`. These are calls to `//w/api.php`, `/w/api.php`, `/w/index.php` in... [10:56:42] 06Traffic, 10Liberica, 13Patch-For-Review: Reduce the chances of false positives on MSS clamping alerts - https://phabricator.wikimedia.org/T400155#11515772 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:35:04] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11516063 (10elukey) >>! In T250367#11511124, @ayounsi wrote: >> Is sretest2003 the only one that shows this behavior, or do we have others? I am particularly i... [14:42:28] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11516901 (10kostajh) > 1) passing the relevant headers through to MediaWiki Who from #SRE co... [14:43:22] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11516906 (10Vgutierrez) [14:56:04] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erronouesly) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473 (10ssingh) 03NEW [14:56:21] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11516987 (10ssingh) [15:13:42] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11517055 (10taavi) [15:14:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11517056 (10cmooney) @Jclark-ctr I went to do this but it turns out we need to disconnect all the switch - switch links before the de... [15:19:18] 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11517101 (10ssingh) >>! In T392851#11514052, @Jhancock.wm wrote: > @ssingh do you need assistance getting these reimaged? Thanks for the offer, @Jhancock.w... [15:25:59] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517177 (10ssingh) Yes, thanks for the ping @Paladox. We should most certainly pick this up again. @BBlack: any fresh 2026 thoughts? You listed some concerns above but some of them don't apply anymore -- should we do... [15:27:49] 06Traffic, 10MediaWiki-Debug-Logger, 06SRE, 06MediaWiki-Platform-Team (Q3 Kanban Board): Pass through information about the client from the CDN to MediaWiki to Logstash - https://phabricator.wikimedia.org/T412396#11517188 (10ssingh) >>! In T412396#11516901, @kostajh wrote: >> 1) passing the relevant heade... [15:29:55] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517195 (10ssingh) I should mention that `ns[01]` v6 will be unicast, like v4, and `ns2` will be anycast v6, just like the v4 one. But these are minor operational details, the real question is if we are ready to do th... [15:46:21] 06Traffic, 06SRE: All github action tests of Pywikibot fails due to 429 status code (TOO MANY REQUESTS) - https://phabricator.wikimedia.org/T414173#11517274 (10Xqt) @Fabfur: I can’t reproduce this issue locally, but it still occurs in the Pywikibot tests, though less frequently, see https://github.com/wikimedi... [15:52:16] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517321 (10BBlack) We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to remember to come back here and look! A few no... [15:57:47] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11517352 (10BBlack) [In fact, on that point, I'd note a quick survey of a handful of other major sites on the Internet shows a common pattern of 2 days for the NS records and 2-4 days on the matching address records.... [16:03:02] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11517391 (10RobH) cp5022 is unresponsive to ping on its primary interface (expected with OS down) and idrac/mgmt interface (unexpected). 1-255962774671 entered should be completed by 2026-01-15 @ 13:... [16:04:21] k vgutierrez / fabfur: patch updated with ins- prefix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1218817 [16:20:43] 10netops, 06Infrastructure-Foundations, 06Data-Engineering (Q3 FY25/26 January 1st - March 31th), 07Epic: SDS 1.3.8 Review network constraints of 100% sampled instruments - https://phabricator.wikimedia.org/T414487 (10Milimetric) 03NEW [16:21:44] 06Traffic, 10Prod-Kubernetes, 07Kubernetes, 13Patch-For-Review, 10ServiceOps new: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#11517592 (10MLechvien-WMF) a:05Vgutierrez→03JMeybohm [16:22:07] jayme: \o/ [16:22:13] eheh [16:22:25] don't open the bottles just yet :p [16:23:25] I might come after you with super stupid questions and all that [16:25:40] 06Traffic, 07Essential-Work, 05MW-1.46-notes (1.46.0-wmf.5; 2025-12-02), 13Patch-For-Review, 06Test Kitchen (Test Kitchen (Experiment Platform Sprint 18)): Test the impact of incremental increase in traffic for cache splitting experiments - https://phabricator.wikimedia.org/T407570#11517631 (10KReid-WMF) [16:31:30] @fabfur and @vgutierrez: dates of our 10% sampled instrument on enwiki were 2025-10-18 to 2025-10-31 [16:34:45] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11517731 (10Vgutierrez) [16:52:32] cdanis: +1 from us on the Gerrit rollout, in magru and then global. let us know how the experience is, feedback to vg for the stellar liberica experience :> [16:54:48] <3 [16:54:55] thank you! I'll wait until monitoring works again :> [16:55:14] unless you get a nuclear mushroom in Sao Paolo [16:55:19] then I'm not here [16:59:39] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11517940 (10elukey) Hey folks, to restart the conversation SRE would like to build a bridge between HDFS and the puppetservers. T... [17:06:57] 06Traffic, 06DC-Ops, 10ops-eqsin, 06SRE: cp5022 is unreachable - https://phabricator.wikimedia.org/T414411#11517997 (10RobH) > IBX Question:Dear Customer,We have traced both power cables, they are both connected at port 30 of PS1 and PS2.We have also unplugged and plug back both power cables as instructed.... [17:12:12] milimetric: 2025-10-18 or 2025-10-28? [17:19:44] vgutierrez: I triple checked, pretty sure it's 2025-10-18 (it was really 10-17 but we didn't get much traffic that first day) to 2025-10-31, why are you seeing something different? [17:22:13] milimetric: https://grafana.wikimedia.org/goto/entdq34Dg?orgId=1 made me wonder [17:24:01] I see, interesting - but we turned it off 10-31 and traffic went down (not to zero but a lot) after that, so that spike being sustained makes me think it's something else [17:26:53] zooming out to 90 days tells me that yes [17:34:37] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11518247 (10JAllemandou) [18:28:41] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11518529 (10cmooney) Huh yeah this is quite odd alright. Taking dse-k8s-worker1011 and dse-k8s-worker1013 as two example hosts to... [18:33:17] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11518537 (10ssingh) >>! In T81605#11517321, @BBlack wrote: > We have to take this plunge someday, and that someday probably should've been years ago, just too many other pressing things to focus on for anyone to rememb... [18:36:48] 06Traffic, 06SRE: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11518553 (10ssingh) Our glue records also have a disparity. ` dig wikimedia.org NS +trace +additional ns2.wikimedia.org. 3600 IN A 198.35.27.27 ns1.wikimedia.org. 3600 IN A 208.80.153.231 ns0.wikimedia.org. 3600 IN A... [19:22:43] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11518749 (10JAllemandou) >>! In T402512#11517940, @elukey wrote: > Hey folks, to restart the conversation SRE would like to build... [19:45:35] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11518808 (10CDanis) I took a quick look at the state of sockets on dse-k8s-worker1010, since FIN_WAIT_1 is //not// supposed to stic... [19:52:09] 10netops, 06Infrastructure-Foundations, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11518838 (10cmooney) Thanks @cdanis, yeah in terms of the TCP state machine I wasn't quite sure how the apparent packet loss transl... [19:57:53] 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11518865 (10Scott_French) [20:05:48] 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: rest gateway: implement cost-based rate limits - https://phabricator.wikimedia.org/T412586#11518900 (10Scott_French) @Clement_Goubert @daniel - If you could provide more detail on sizing, timing, and priority at your convenience, th... [20:07:54] 06Traffic, 06MW-Interfaces-Team, 07Epic, 05FY2025-26 KR 5.1, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11518902 (10Scott_French) [21:22:35] btw, I got sidetracked on the gerrit CDN rollout today, but I will push buttons tomorrow