[08:48:20] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11527862 (10brouberol) I can take care of spinning up the airflow instance if required @BTullis. [08:55:03] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11527881 (10elukey) @brouberol yeah let's do it if you have time! [09:03:36] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11527890 (10ayounsi) With a long running MTR from alert1002 to 195.200.68.98 (doh7003), I was able to capture this routing change. `name=standard path HO... [09:04:06] 06Traffic, 10DNS, 06serviceops, 06SRE, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527892 (10Dzahn) Yea, it is. Languages would typically be added to `dns/templates/helpers/langlist.tmpl` but it feels like adding a non-language to the "... [09:07:21] 06Traffic, 10DNS, 06serviceops, 06SRE, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527902 (10Dzahn) I would think it belongs into the section for `Wikis with mobile site (alphabetic order), which are not covered by langlist.tmpl`. htt... [09:08:22] 06Traffic, 10DNS, 06serviceops, 06SRE, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11527904 (10Dzahn) Note that there is also a section for ` Wikis without mobile site (alphabetic order), which are not covered by langlist.tmpl` right belo... [09:29:47] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11527945 (10cmooney) @ssingh my apologies I even deliberately tried searching for this task and somehow didn't find it the other day, thanks for filing.... [09:59:07] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11528045 (10BTullis) >>! In T402512#11527862, @brouberol wrote: > I can take care of spinning up the airflow instance if required... [10:01:20] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11528049 (10brouberol) Sure thing. I'd need a couple of details. from you @elukey, namely the defaut team name DAGs would be labe... [10:09:08] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11528095 (10cmooney) For now I have removed the temp static route config on cr1-eqiad. Let's see how things go. [10:13:55] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11528106 (10elukey) @brouberol I'd say team name "sre" and the root wikimedia email as starter, then later... [10:28:14] 06Traffic, 06MW-Interfaces-Team, 06ServiceOps new, 07Epic, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11528143 (10daniel) [10:28:19] 06Traffic, 06MW-Interfaces-Team, 06serviceops, 07Epic, and 3 others: Epic: API Rate Limiting Architecture - https://phabricator.wikimedia.org/T399291#11528144 (10daniel) [10:29:40] 06Traffic, 06MW-Interfaces-Team, 06ServiceOps new, 07Epic, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11528147 (10daniel) [10:29:43] 06Traffic, 06MW-Interfaces-Team, 06serviceops, 07Epic, and 3 others: Epic: API Rate Limiting Architecture - https://phabricator.wikimedia.org/T399291#11528148 (10daniel) [10:31:48] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11528160 (10MoritzMuehlenhoff) For logging into the instance we can use cn=ops,ou=groups,dc=wikimedia,dc=org [11:30:05] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: eqiad: rows C/D Upgrade Decom Asw Switches in Rows C & D - https://phabricator.wikimedia.org/T412525#11528293 (10Jclark-ctr) @cmooney i have disconnected all the switches [13:01:20] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11528619 (10BTullis) I have also made the following ticket regarding upgrading he 1 Gbps network connections: {T414787} [13:39:11] 06Traffic, 06ServiceOps new, 10ServiceOps-Services-Oids, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11528729 (10Raine) p:05Low→03Lowest [13:40:31] 06Traffic, 06ServiceOps new, 10ServiceOps-Services-Oids, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11528734 (10Raine) p:05Lowest→03Low [13:54:49] mutante: FIRING: [14x] SystemdUnitFailed: prometheus-node-textfile-check-nft.service on tcp-proxy1001:9100 [13:55:09] mutante: that nft cleanup needs some love [14:00:28] vgutierrez: sigh.. I already deleted that yesterday and did not expect it to come back.. ok [14:01:14] like why is the unit re-created by puppet. :( [14:03:06] puppet.log and puppet.log.1 doesn't show it as being recreated [14:03:51] but the puppetization doesn't support the removal of the check [14:04:07] it just disappeared from the catalog when you switched to ferm [14:04:58] so the service and the timer need to be removed manually [14:05:46] that is what I did. deleted unit file and systemtl reset-failed via cumin and it recovered [14:06:42] but you need to reload systemd as well [14:07:02] as in systemctl daemon-reload [14:07:20] otherwise the unit stays active [14:07:24] https://www.irccloud.com/pastebin/tV9l7ylr/ [14:07:40] ok, I will do that. I just mean I 100% saw the recovery [14:08:39] ack, on it. crappy wifi [14:08:48] and you could wipe check-nft as well: -r-xr-xr-x 1 root root 1.4K Oct 27 20:30 /usr/local/bin/check-nft [14:12:50] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only - https://phabricator.wikimedia.org/T414805 (10MatthewVernon) 03NEW [14:19:09] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: Propose a new set of standard thumbnail sizes - https://phabricator.wikimedia.org/T412971#11529006 (10MatthewVernon) 05Open→03Resolved I think we're settled on this set of sizes. [14:20:54] 06Traffic: Consider rate limiting non-standard thumbnail sizes - https://phabricator.wikimedia.org/T402792#11529010 (10MatthewVernon) [14:22:58] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11529015 (10MatthewVernon) [14:23:07] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11529019 (10MatthewVernon) 05Open→03Resolved [14:23:39] systemctl daemon-reload done. check-nft deleted. double checked nothing called *nft*" in /lib/systemd/system/ at all. systemctl reset-failed (barely online :p) [14:25:00] ack [14:38:28] it resolved. found a public wifi [14:38:53] cool [14:42:07] 06Traffic, 10DNS, 06serviceops, 06SRE, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11529094 (10ssingh) Yes, I should have clarified better, sorry. There is nothing special about `langlist.tmpl` as such. It just lists the language editions... [14:47:29] 06Traffic, 10DNS, 06serviceops, 06SRE, and 3 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11529117 (10ssingh) ` {% from "helpers/langlist.tmpl" import langs %} {% for lang in langs -%} {{ lang }} 1D IN CNAME dyna.wikimedia.org. {{ lang }}.m 1D... [14:47:37] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529119 (10cmooney) >>! In T81605#11522551, @ssingh wrote: > @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `2620:0:860:ed1a::4/128` under LVS service I... [14:49:36] 10netops, 06Infrastructure-Foundations: asw1-b12-drmrs stopped reporting metrics - https://phabricator.wikimedia.org/T413181#11529124 (10ayounsi) JTAC asked us to try to reboot various deamons, none of them worked. Now they asked for a full switch reboot. I followed up saying I'd rather troubleshot the issue p... [14:49:43] FIRING: [7x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [14:54:23] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529143 (10cmooney) >>! In T81605#11518553, @ssingh wrote: > Our glue records also have a disparity. I was interested to know what effect this would have. One data-point for Bind (at least my l... [14:54:43] FIRING: [8x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:04:43] RESOLVED: [8x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [15:07:46] 10netops, 06Traffic, 06Infrastructure-Foundations: magru hosts (erroneously) reported down due to TTL exceeded - https://phabricator.wikimedia.org/T414473#11529188 (10ssingh) Thanks for looking into this, folks! And also for submitting the other patch for splitting the magru traffic. [15:52:45] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529426 (10ssingh) >>! In T81605#11529119, @cmooney wrote: >>>! In T81605#11522551, @ssingh wrote: >> @cmooney: Any picks for your favourite v6 address for `ns1`? I was thinking of allocating `26... [16:11:20] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529540 (10cmooney) >>! In T81605#11529426, @ssingh wrote: > Thanks! The plan is to the same for `eqiad` Ok I've reserved those two ranges/IPs in Netbox now. > Any thoughts on the last one (an... [16:15:01] 06Traffic, 06MediaWiki-Platform-Team, 06MW-Interfaces-Team, 06ServiceOps new, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11529556 (10Clement_Goubert) p:05Triage→03High [16:16:19] 06Traffic, 06MediaWiki-Platform-Team, 06MW-Interfaces-Team, 06ServiceOps new, and 3 others: Epic: Enforce API rate limits (WE5.1.3c) - https://phabricator.wikimedia.org/T412585#11529569 (10Clement_Goubert) p:05High→03Triage Sorry for the priority noise, I misclicked while triaging. [16:20:06] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11529590 (10ssingh) >>! In T81605#11529540, @cmooney wrote: >>>! In T81605#11529426, @ssingh wrote: >> Thanks! The plan is to the same for `eqiad` > > Ok I've reserved those two ranges/IPs in Ne... [16:29:44] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11529638 (10elukey) Next steps: * DP to create the Airflow SRE instance. * Me and DP to configure the rsy... [17:07:10] https://letsencrypt.org/2026/01/15/6day-and-ip-general-availability [17:07:51] 90-day and IP address certs available [17:09:09] 90 days is the standard :) [17:09:10] neat [17:09:21] glad I have cert-manager set up properly at home finally :> [17:12:34] sorry I meant the new 45 one [17:13:08] hmm 6 days [17:13:30] will we ever do those? [17:13:42] 45 days will be available for early adopters in May 13, 2026 [17:13:47] per https://letsencrypt.org/2025/12/02/from-90-to-45 [17:18:31] yeah. last time we discussed this, we decided we will try 6 days but will likely default to 45? [17:59:36] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530021 (10ayounsi) overall lgtm Using a full /64 unicast `2620:0:860:53::/64` for a single service looks a bit weird, but as it's something critical like AuthDNS it doesn't shock me too much. T... [18:12:00] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530059 (10ssingh) >>! In T81605#11530021, @ayounsi wrote: > overall lgtm > > Using a full /64 unicast `2620:0:860:53::/64` for a single service looks a bit weird, but as it's something critical... [18:53:20] 06Traffic, 06Data-Persistence, 10MediaViewer, 10SRE-swift-storage, 10Thumbor: FY 25/26 WE 5.4.7 Standardize thumbnail sizes - https://phabricator.wikimedia.org/T408062#11530186 (10simon04) I wonder whether any documentation need to be updated, for instance... - https://www.mediawiki.org/wiki/Help:Im... [19:02:26] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530209 (10BBlack) >>! In T81605#11529143, @cmooney wrote: >>>! In T81605#11518553, @ssingh wrote: >> Our glue records also have a disparity. > > I was interested to know what effect this would... [20:21:48] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530397 (10cmooney) >>! In T81605#11530209, @BBlack wrote: > Except almost nobody but engineers are going to directly query that record. Most caches will learn and re-learn it as they traverse t... [21:12:34] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530539 (10jeremyb) >>! In T81605#11530397, @cmooney wrote: > As a further test I wiped my cache, started a packet capture and did a dig for '//en.wikimedia.org//'. did you intend to use a non-c... [21:41:56] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530621 (10BBlack) >>! In T81605#11530397, @cmooney wrote: > But it seems Bind does not cache the glue records / additional that comes back from the .org authdns. At least for any length of time... [23:39:50] 06Traffic, 06SRE, 13Patch-For-Review: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605#11530844 (10cmooney) >>! In T81605#11530539, @jeremyb wrote: > did you intend to use a non-canonical domain here? pedia vs media. Ah sorry that was a typo, corrected now. I looked up //en.wikipe...