[00:14:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523618 (10Papaul) Phase 1 of ULSFO migration which was changing the loopback addresses of cr1,cr4 ,mr1 and the IP address of the link between cr3 and cr4 was... [01:38:23] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.97:443 @ cp4039 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [02:47:40] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [02:52:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:07:41] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:12:41] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:16:10] FIRING: [5x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:17:56] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:21:10] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:22:56] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:27:55] FIRING: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:37:55] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:40:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523794 (10Papaul) [03:56:10] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [03:57:55] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5017:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [04:28:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523834 (10Papaul) [05:35:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11523872 (10Papaul) [05:38:23] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.97:443 @ cp4039 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [07:15:21] hey traffic, is someone around and could help with a purge maybe? [07:22:55] kinda early but sure [07:23:58] vgutierrez: I just ran this one: sudo cumin -b 1 A:cp-text "varnishadm -n frontend ban 'req.http.host == \"wikipedia25.org\"'" [07:24:13] I think it already fixed a part of the issue [07:24:32] I need another one for this URL: https://www.wikipedia25.org/en [07:24:44] that is a 404 but should not be [07:24:57] the first one was a cached "domain not configured" but works now [07:25:20] <_joe_> https://www.wikipedia25.org/en works for me [07:25:36] <_joe_> mutante: but you can ban individual urls with the mwscript command [07:26:18] <_joe_> ah no wait [07:26:21] _joe_: thank you. hmm.. where can I find that [07:26:29] <_joe_> it works if you go through https://www.wikipedia25.or [07:26:29] I could repeat the above just with www. [07:26:41] <_joe_> I think they do something like document.location in js [07:26:50] <_joe_> so I think the problem is actually on the backend [07:27:20] ooh! thanks, I am forwarding this to the dev [07:27:30] <_joe_> mutante: where is the microsite repo? [07:28:20] <_joe_> mutante: https://wikitech.wikimedia.org/wiki/Kafka_HTTP_purging [07:28:38] <_joe_> I'll try just to make sure, but that 404 comes from apache I think [07:28:40] https://gitlab.wikimedia.org/toolforge-repos/wikipedia25-years-of-wikipedia [07:29:29] yeah... purging on varnish isn't enough [07:29:42] <_joe_> what's the docker image, mutante ? [07:29:49] <_joe_> sorry it's quicker than looking up myself [07:30:01] https://docker-registry.wikimedia.org/repos/sre/miscweb/wikipedia25-years-of-wikipedia/tags/ [07:30:07] <_joe_> I think maybe adding a RewriteRule is enough [07:30:07] 2026-01-14-150341 [07:30:28] tried so hard to avoid a k8s deploy at this moment :/ [07:30:37] trying kafka purge [07:31:28] <_joe_> no kafka won't owrk [07:31:31] <_joe_> tyring a fix [07:31:42] <_joe_> I reproduced locally [07:31:43] ack, thanks [07:31:47] talking in slack as well [07:33:36] <_joe_> ok I have a solution, let me try the easier one [07:33:55] hmm the 404 for https://www.wikipedia25.org/en comes from varnish BTW [07:34:05] <_joe_> no it comes from the image [07:34:10] should I just repeat the purge I did above.. but for the www. [07:34:16] <_joe_> I reprod locally [07:34:23] ok [07:34:23] <_joe_> give me 5 minutes I'll have a patch [07:34:31] :) [07:36:57] yes, "from the / > /en, is a js rewrite of the url/history without real redirect and actually fetching /en (edited) " [07:39:51] <_joe_> so one solution is to copy the index.html that is at the root of the container to en/index.html [07:40:35] ACK, forwarding this option [07:40:51] <_joe_> the other is doing something with apache rewriterules [07:40:56] <_joe_> which I'm trying now [07:42:19] Artem says option 1 would not work and we need the rewrite rule [07:42:29] he is the developer of the site [07:42:47] <_joe_> mutante: tell artem I just verified it works and to shut up [07:43:59] "There are too many urls to cover, 7 languages, 16 urls for each language" [07:44:31] wonders how the tests went [07:46:15] <_joe_> ok so how can I do the rewriterule? [07:46:24] <_joe_> I need at least the full list of languages [07:46:45] <_joe_> so the rewrite rule has the same issue [07:46:49] <_joe_> also what rules [07:47:59] the languages: ar en es fr ja ms pt [07:48:22] I am inviting him to join [07:48:31] there is artem [07:48:36] hello [07:48:39] so.. I just listed the 7 languages [07:49:32] artemkloko how should the rules look? [07:50:19] <_joe_> ok give me the list of languages, I can prepare th epatch [07:51:00] _joe_: ar en es fr ja ms pt [07:51:26] <_joe_> mutante: where is the service-vhost config in the repo? [07:51:35] <_joe_> or is it in the base image? [07:51:36] <_joe_> sigh [07:52:27] production/service-vhost.conf: ServerName wikipedia25.org [07:52:47] ServerName wikipedia25.org [07:52:47] ServerAlias www.wikipedia25.org [07:52:53] <_joe_> ah ok [07:55:15] <_joe_> btw the problem is there for any of the urls [07:55:23] <_joe_> I can only fix the base one [07:55:38] <_joe_> if someone reloads the page, it's going to get a 404 [07:56:08] All traffic from "non found" resources should be directed to index.html [07:56:08]         RewriteEngine On [07:56:09]         RewriteBase / [07:56:09]         RewriteRule ^index\.html$ - [L] [07:56:10]         RewriteCond %{REQUEST_FILENAME} !-f [07:56:10]         RewriteCond %{REQUEST_FILENAME} !-d [07:56:11]         RewriteRule . /index.html [L] [07:56:29] should I push a change to the main branch of the repo? [07:57:57] <_joe_> that's a bit too wide, I was limiting it to just /{en} and co [07:58:03] <_joe_> but it should also work [07:58:09] <_joe_> yes please push the change [07:58:33] <_joe_> artemkloko: have you tested locally? [07:59:16] will update the image version in deployment-charts once this went through [07:59:22] <_joe_> RewriteBase / and RewriteRule ^index\.html$ - [L] are not needed [07:59:30] <_joe_> but they do no harm [08:01:12] > have you tested locally? [08:01:12] doing now [08:01:31] <_joe_> I *think* it should work, else there's a solution using mod_alias [08:03:15] <_joe_> I'm not sure your change works fwiw [08:07:11] almost done testing, give me a sec [08:07:56] <_joe_> what you propose doesn't work. [08:08:59] <_joe_> I'm preparing a proper fix [08:09:10] why? [08:09:29] it works on my local server [08:10:04] <_joe_> i'm tesing in the docker image [08:10:06] but I am ok with a proper solution ofc [08:10:16] <_joe_> how are you testing? [08:11:54] artemkloko: just to double check, have you tried refreshing the page locally? [08:11:55] <_joe_> artemkloko: RewriteBase isn't an apache 2.4 directive IIRC [08:12:47] DOCKER_BUILDKIT=1 docker build --tag wikipedia25-years-of-wikipedia-blubber --target production -f .pipeline/blubber.yaml . [08:12:48] docker run --rm -d -p 8080:8080 wikipedia25-years-of-wikipedia-blubber [08:13:17] <_joe_> and RewriteBase / works? [08:14:26] looks like it's in apache 2.4 https://httpd.apache.org/docs/current/mod/mod_rewrite.html#rewritebase [08:14:33] on my local network and server [08:14:33] http://thinkstation.local:8080/ > works [08:14:34] http://thinkstation.local:8080/en > works [08:14:34] refreshing http://thinkstation.local:8080/en > works [08:15:40] deeper urls like http://thinkstation.local:8080/ja/the-world-is-changing also work [08:15:43] <_joe_> does /fr works? on my test it hangs [08:16:20] a user reported: "When I click on any of the 'Transcript' buttons (at the top of each of the audio snippet windows), I get taken to a 'Not Found' page." [08:16:26] <_joe_> ah found the problem [08:16:28] yes both /fr and /fr/ [08:16:49] <_joe_> RewriteRule ^index\.html$ - [L] is where the syntax error is, and it's superfluous [08:17:07] > Transcript [08:17:07] yes because it should also be handled by the js router after the triffic is directed to /index.html [08:17:13] ack [08:17:51] <_joe_> artemkloko: what I get is [08:17:55] <_joe_> AH00526: Syntax error on line 21 of /srv/app/service-vhost.conf: [08:17:55] <_joe_> RewriteBase: only valid in per-directory config files [08:18:07] <_joe_> you have added it inside the Directory? [08:18:27] yes [08:18:45] <_joe_> yeah, maybe not [08:18:50] <_joe_> ok lemme send my version [08:18:56] can you please propose a proper fix for the syntax error? [08:19:32] <_joe_> artemkloko: just to make sure, you confirm the fix is to send any non-existent url to index.html? [08:20:59] yes [08:22:39] <_joe_> artemkloko: still trying to make sure everything works with the rewrites in the subdir as you did, hang on a sec [08:23:50] <_joe_> so one problem i have is [08:24:01] <_joe_> if I switch languages in the UI, it works [08:24:11] <_joe_> if I try to reload afterwards, it hangs [08:24:14] <_joe_> no idea why tbh [08:24:23] <_joe_> for any languiage that's not english [08:24:41] <_joe_> but looks like a vue issue rather than a apache conf one [08:25:36] <_joe_> lemme try a last thing with rewrites [08:26:05] can you check the network tab in the browser inspector? does it load a bunch of js and media files, or hangs on something like the initial url loading? [08:27:04] <_joe_> it loads assets/index-DoazOAh7.js [08:27:23] <_joe_> my browser is set to english fwiw [08:28:05] <_joe_> ah yes sorry [08:28:09] <_joe_> http://localhost:8080/pt/which-wikipedia-of-the-future-are-you this works [08:28:13] <_joe_> reload I mean [08:29:07] <_joe_> http://localhost:8080/pt/ works [08:29:10] <_joe_> http://localhost:8080/pt doesn't [08:29:27] <_joe_> it's a bug in the website not the rewrites I think [08:29:34] <_joe_> because /en works [08:30:04] <_joe_> ok you can submit your version, there's no real difference between the results I get [08:31:27] <_joe_> artemkloko: ^^ please go ahead, that bug will need fixing separately [08:31:34] <_joe_> or I can submit my patch [08:31:59] I have pushed a change to the repo already, let me check the build pipeline [08:32:05] <_joe_> ok [08:32:21] <_joe_> mutante: will you take care of the k8s deployment/patch I guess? [08:32:27] _joe_: yes, I will [08:32:47] waits for a new image on https://docker-registry.wikimedia.org/repos/sre/miscweb/wikipedia25-years-of-wikipedia/tags/ [08:36:21] <_joe_> don't wait for that, you can get the name from the pipeline [08:36:41] <_joe_> mutante: https://gitlab.wikimedia.org/repos/sre/miscweb/wikipedia25-years-of-wikipedia/-/jobs/719999 [08:36:55] <_joe_> docker-registry.wikimedia.org/repos/sre/miscweb/wikipedia25-years-of-wikipedia:2026-01-15-080024 [08:37:04] I see 2026-01-15-080024 in the gitlab job [08:37:11] ok! ack. updating! [08:37:24] yes, from pipeline [08:38:40] pushing https://gerrit.wikimedia.org/r/1227260 .. on it [08:42:27] ok, I have the diff in helmfile now.. deploying to k8s [08:44:49] artemkloko: deployed 2026-01-15-080024 ! [08:45:39] ty!!! it works for me [08:45:50] ❤️ [08:45:55] :) [08:46:48] 🎉 [08:47:14] Still fails for me on a mobile browser [08:47:30] Both Safari and Firefox Focus [08:47:35] Fresh sessions [08:47:58] Could someone else verify on mobile? [08:48:19] The reload fails, first load is fine [08:48:24] For clarity [08:49:07] for me it's like English works but not Spanish or French [08:49:08] Now it loaded after a few more reloads [08:49:15] ah [08:49:32] yes, same. now works [08:49:42] And looks like it’s fine in all languages [08:49:45] works in english here [08:50:19] fr confirmed, / redirect properly works [08:50:36] Transcripts redirect properly as well [08:50:54] _joe_: thank you so much [08:51:27] yes, glad you were around [08:52:21] Thank you everyone for the effort and great work! [08:52:58] artemkloko: are you in contact with stakeholders announcing it about the status? [08:53:09] can we mention it in public places [08:53:14] yes, we are doing a mini qa first [08:53:30] ok! great [08:53:31] then we will announce the successfui deployment [08:53:37] alright [08:53:46] glad it wasnt already announced [08:54:03] that would have been a rough start [08:54:22] alright, you should go to that QA now. I will keep an eye on this and slack just in case [08:57:53] XioNoX, topranks I'm getting old [08:58:01] I'm debugging an issue with IPIP traffic on tcp-proxy7001 [08:58:16] traffic reaches the instance from the lvs or a test host [08:58:26] pwru is telling me that there is some kind of netfilter drop [08:58:30] 0xffff93d3b5046200 0 :0 4026531840 0 ens13:2 0x0800 65536 80 172.16.174.72:0->10.140.2.10:0() sk_skb_reason_drop(SKB_DROP_REASON_NETFILTER_DROP) [08:59:14] old?? but you're using all the hot cool toys like the kids do! [08:59:34] so the SYN packet arrives as expected via the IPIP tunnel [08:59:39] and it gets accepted by netfilter [08:59:54] 51231 4098K ACCEPT 4 -- * * 172.16.0.0/12 0.0.0.0/0 [09:00:07] 172.16.0.0/12 is the range we use as source for IPIP traffic [09:00:28] what is puzzling me is that I don't see any DROP counter increasing [09:00:45] and drops aren't being logged either [09:01:24] and -t raw doesn't have anything interesting either [09:02:28] this is my favorite channel today.love to see it [09:02:41] it's also happening for IPv6 traffic FWIW [09:02:52] and same thing about encapsulated traffic [09:02:54] 51486 6178K ACCEPT 41 -- * * 100::/64 ::/0 [09:03:33] I'm tempted to blame `src_sets => ['PRODUCTION_NETWORKS'],` on ferm config for proxy-gerrit-ssh [09:03:48] but I'd expect to see an iptables DROP log increasing [09:05:01] rp_filter looks as expected FWIW [09:05:43] hmmm almost [09:06:16] net.ipv4.conf.default.rp_filter = 2 [09:09:52] nah... I manually set those to 0 and same thing [09:11:49] yeah loose mode (2) should be ok in this circumstance, on a box with a default route it basically does the same as 0 [09:12:28] vgutierrez: do you need to enable forwarding on the box? [09:12:34] yeah I'm not seeing it, there are no DROPs in any of the rules in the input chain filter [09:13:15] I'd have thought forwarding is not needed but that's good thinking, worth a shot to enable it [09:13:16] XioNoX: nope [09:13:27] at least I got it disabled on ncredir boxes and it's working as expected [09:15:08] can you share the full pwru trace? tbh I'm unlikely to make sense of it but just in case [09:16:41] https://www.irccloud.com/pastebin/yj2wIEdD/ [09:20:50] and exactly the same pwru output when filtering proto 41 instead of proto 4 [09:21:49] no the full trace of course... but hitting the same netfilter_drop [09:25:41] nothing in the iptables logs since december.. /var/log/ulogd/syslog.log [09:28:52] 10netops, 06Infrastructure-Foundations, 06SRE: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11524188 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi All hosts that are not pending decom have been migrated to single uplink, resolving. [09:38:23] FIRING: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.97:443 @ cp4039 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [09:44:39] hmmm [09:44:46] how nft rules interact with iptables? [09:45:28] cause we also have nft rules in that box [09:45:48] and I'm not seeing anything there that allows IPIP traffic [09:47:24] is it normal to have both? [09:47:33] I thought it was one or the other [09:48:20] me too tbh [09:48:24] moritzm: ^^ [09:49:31] no, it's not normal to have both. usually its one or the other [09:49:52] so tcp-proxy puppetization has some kind of issue? [09:50:06] if you switch the firewall::provider thing in Hiera to nft it normally removes iptables [09:50:34] oh. we had to switch that [09:50:50] looking [09:50:59] realserver::ipip requires iptables so far AFAIK [09:51:43] this was an usual case. because we started with nftables and then "downgraded" to ferm [09:51:46] unsual [09:51:56] for that reason [09:52:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1215284 [09:52:28] so its supposed be only ferm/iptables [09:53:37] stopping nftables should be enough? [09:53:40] let me see if I can clean it up on a random one.. like tcp-proxy1001 [09:54:20] well, removing the package [09:54:34] rules are still in the kernel [09:54:41] removing the package won't fix it [09:54:51] unless there is a hook that takes care of them [09:54:52] and reboots [09:55:03] or that :) [09:55:12] wanna take a break? I can take this [09:55:17] and let you know when done [09:56:10] puppet does the cleanup automatically but I guess only for the expected path from ferm -> nft.. not backwards [09:57:07] adding support for nft should be feasible BTW [09:57:18] but so far we didn't have any realserver using it [09:57:34] I see. ack [09:57:39] but that seems for a later time [09:57:44] sorry I got dragged away. but yes it shouldn't have nft and iptables, something definitely wrong there [09:58:11] there's no tested/supported puppet path for nft -> ferm [09:58:40] we spent a lot of effort to get ferm -> nftables to work fine, but for the reverse there wasn't any real case [09:58:43] removing the nftables package and rebooting all the tcp-proxies [09:58:44] I can't really imagine why we would ever want that [09:59:00] either clean it up manually or reimage them now that the correct setting is in Puppet [09:59:03] when we think we want nftables and then realize we cant use it for this case yet :) [09:59:24] what are you missing with nftables that you need to go back? [09:59:52] "liberica does not support nftables yet " [10:01:02] ok :) [10:02:40] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524261 (10elukey) ` elukey@krb1002:~$ cat puppetserver_keytabs puppetserver1001.eqiad.wmnet,create_princ... [10:03:03] purging nftables via cumin [10:06:00] ack [10:08:10] doing reboots [10:10:40] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, 06SRE: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11524278 (10cmooney) >>! In T408892#11523618, @Papaul wrote: > Phase 1 of ULSFO migration which was changing the loopback addresses of cr1,cr4 ,mr1 and the IP... [10:16:07] mutante: that fixed the issue BTW [10:16:29] hmm or not [10:17:51] vgutierrez: all 14 hosts have had the package removed and were rebooted. confirmed via cumin. "nft: command not found" [10:18:00] issue as in healtchecks being dropped [10:18:13] and yes.. [10:18:15] it fixed the issue [10:18:16] Jan 15 10:16:30 lvs7003 libericad[1528]: time=2026-01-15T10:16:30.310Z level=INFO msg="detected healthcheck state change" service=gerrit-sshlb6_29418 hostname=tcp-proxy7002.magru.wmnet address=2a02:ec80:700:103:10:140:2:11 healthcheck_name=IdleTCPConnectionCheck healthcheck_id=1227911398 healthcheck_result_old=false healthcheck_result=true [10:18:20] cool [10:18:36] gerrit-sshlb6_29418: [10:18:36] 2a02:ec80:700:103:10:140:2:10 1 healthy: true | pooled: yes [10:18:37] 2a02:ec80:700:103:10:140:2:11 1 healthy: true | pooled: yes [10:18:53] gerrit-sshlb_29418: [10:18:53] 10.140.2.10 1 healthy: true | pooled: yes [10:18:53] 10.140.2.11 1 healthy: true | pooled: yes [10:19:04] so this actually means we have gerrit behind the CDN now .. mind blown [10:19:13] of course just as "opt-in" right now [10:19:28] well.. just in magru for ssh [10:19:33] roll out needs to be completed [10:19:38] gotcha! ack [10:20:28] loving it [10:42:08] RESOLVED: [8x] LVSRealserverMSS: Unexpected MSS value on 198.35.26.97:443 @ cp4039 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=ulsfo&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [11:07:31] 06Traffic, 06SRE Observability, 10ServiceOps new: Proof of Concept: SquareOne CDN Dashboards - https://phabricator.wikimedia.org/T414665 (10jijiki) 03NEW [11:11:17] 06Traffic, 06SRE Observability, 10ServiceOps new: Proof of Concept: SquareOne CDN Dashboards - https://phabricator.wikimedia.org/T414665#11524428 (10jijiki) 05Open→03In progress [11:17:34] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524439 (10elukey) While reviewing the last patch with Moritz, he made me realize that we'd need Java 8 on... [11:17:39] 06Traffic, 10Liberica: Support nft enabled realservers using IPIP encapsulation - https://phabricator.wikimedia.org/T414666 (10Vgutierrez) 03NEW [11:17:55] 06Traffic, 10Liberica: Support nft enabled realservers using IPIP encapsulation - https://phabricator.wikimedia.org/T414666#11524450 (10Vgutierrez) p:05Triage→03Medium [11:23:50] mutante: BTW you need to drop the production networks restriction in gerrit-ssh proxy ferm rule [11:26:25] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227294 [11:36:01] 06Traffic: upgrade to HAProxy 2.8.18 - https://phabricator.wikimedia.org/T414318#11524495 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [12:02:46] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11524578 (10cmooney) //dse-k8s-worker1013// seems fairly happy in terms of the original problem since we made the change y... [12:03:07] 06Traffic, 06serviceops, 10ServiceOps-Services-Oids, 10ServiceOps new, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11524579 (10jijiki) [12:12:21] 06Traffic, 10ServiceOps-Services-Oids, 10ServiceOps new, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11524616 (10jijiki) [12:16:37] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11524635 (10BTullis) >>! In T414460#11521367, @CDanis wrote: >>>! In T414460#11521085, @cmooney wrote: >> The k8s host sen... [13:29:29] 10netops, 10Cloud-VPS, 06Infrastructure-Foundations, 06cloud-services-team (FY2025/2026-Q3-Q4): cloud: edge network suffers downtime if one cloudsw is down - https://phabricator.wikimedia.org/T375259#11524793 (10fgiunchedi) [13:39:50] FYI, we're upgrading Bird on the remaining doh* hosts to 2.18 (the pops on routed ganeti already use that version) [13:43:58] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524840 (10elukey) I had a chat with Ben about options for S3, and there is another possible road: > I w... [14:04:46] 06Traffic, 06SRE Observability, 10ServiceOps new: Proof of Concept: SquareOne CDN Dashboards - https://phabricator.wikimedia.org/T414665#11524916 (10ABran-WMF) [14:04:47] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524915 (10BTullis) >>! In T402512#11139224, @brouberol wrote: > We could do this, however there's a fixed... [14:18:28] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524969 (10MoritzMuehlenhoff) >>! In T402512#11524915, @BTullis wrote: >>>! In T402512#11139224, @broubero... [14:20:53] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations, 13Patch-For-Review: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11524987 (10BTullis) >>! In T402512#11524969, @MoritzMuehlenhoff wrote: >> How about [[https://airflow-oper... [14:22:47] vgutierrez: thanks!! [14:24:02] mutante: would you like to do some of the liberica config deployments? it's very easy [14:25:04] oh.. you already triggered a puppet rubn [14:25:06] *run [14:25:45] vgutierrez@carrot:~$ nc -w 3 -zv gerrit-lb.magru.wikimedia.org 29418 [14:25:45] gerrit-lb.magru.wikimedia.org [195.200.68.225] 29418 (?) open [14:25:46] \o/ [14:26:26] 10netops, 06Infrastructure-Foundations, 06SRE, 06Data-Platform-SRE (2026.01.05 - 2026.01.23): Socket leaking on some dse-k8s row C & D hosts - https://phabricator.wikimedia.org/T414460#11525006 (10CDanis) >>! In T414460#11524635, @BTullis wrote: > My assumption is that this is more likely related to the ce... [14:27:57] 💙cdanis@wmftop.nucleosynth.space ~/work/gits/puppet 🕤☕ echo $(dig +short gerrit-lb.magru.wikimedia.org) gerrit.wikimedia.org | sudo tee /etc/hosts [14:27:59] [sudo] password for cdanis: [14:28:01] 195.200.68.225 gerrit.wikimedia.org [14:28:03] 💙cdanis@wmftop.nucleosynth.space ~/work/gits/puppet 🕤☕ git pull [14:28:05] :D [14:28:41] 💙cdanis@wmftop.nucleosynth.space ~/work/gits/puppet 🕤☕ git review [14:28:45] To ssh://gerrit.wikimedia.org:29418/operations/puppet.git [14:28:46] ! [14:28:47] * [new reference] HEAD -> refs/for/production%topic=gerrit-lb [14:28:55] moritzm: do we have something like /var/log/ulogd/syslog.log for nft blocked traffic? [14:32:10] cdanis: BTW... given haproxy is using v4v6 we get this on the "access" log: Jan 15 14:28:33 tcp-proxy7001 haproxy[798]: <134>Jan 15 14:28:33 haproxy[798]: ::ffff:$cdanis_ipv4:46658 [15/Jan/2026:14:28:27.180] gerrit_ssh gerrit_ssh/backend_server 1/113/6557 9106 -- 5/5/4/4/0 0/0 [14:32:23] yeah, I didn't think that was a big deal, but we can change it if needed [14:32:49] misleading if we need to drop traffic using nft but yes [14:32:53] s/nft/iptables/ [14:33:18] also I don't know how good it's to send that to stdout [14:33:18] eh I was imagining we'd just do silent-drop or whatever else in the haproxy config [14:33:22] lol yeah... [14:33:56] sure.. and the silent-drop needs to match $ipv4 or ::ffff:$ipv4? [14:34:23] vgutierrez: src_sets => ['PRODUCTION_NETWORKS', 'LOAD_BALANCER_HEALTH_CHECKS'], ? [14:34:25] mmmmmh okay :D [14:34:39] mutante: that doesn't include 0.0.0.0 :) [14:35:10] the kernel and haproxy sees the user public IP as the source IP for the connection [14:35:32] yeah, once the packet is de-encapsulated, right [14:35:40] yes [14:35:50] does it really mean dropping the rule entirely [14:35:58] or adding to the src_sets [14:35:59] nope [14:36:03] just the src_set bit [14:36:09] mutante: like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227294/2/modules/profile/manifests/tcpproxy.pp [14:37:42] makes sense. I see it's already merged :) [14:40:25] sorry vg [14:40:28] too early [14:40:45] ☕ [14:41:01] obey with your prompt [14:41:07] s/with/to/ [14:58:12] moritzm: all good with the bird rollout I assume? [15:04:44] 06Traffic, 06Data-Engineering, 06Infrastructure-Foundations: Export development_network_probe data to Puppet servers for CDN deployment - https://phabricator.wikimedia.org/T402512#11525272 (10elukey) @BTullis let's proceed, thanks! I can also create them in case, lemme know what works best. [15:10:37] sukhe: yep, wikidough was all uneventful, I'll upgrade hcaptcha-proxy* on Monday [15:10:43] ok great, thanks [15:10:46] <3 [15:15:33] cdanis: mutante: jelto: congrats on the gerrit CDN thing \m/ [15:19:05] gerrit-lb.drmrs.wikimedia.org [185.15.58.225] 29418 (?) open [15:19:08] pretty smooth now :D [15:19:22] vgutierrez: thanks for unblocking everyone [15:19:45] a little bit of 🔨 was needed [15:20:23] yeah many thanks vg, both for the on-the-spot help twice, and, for the smoothness of liberica operations [15:25:07] uhhhhh now I have a dumb question [15:25:12] no libericas in core DCs even for public-facing? [15:25:20] nope [15:25:31] hmm [15:25:38] we didn't want to mix the configs and have both pybal and liberica running [15:25:40] you'll have to restart pybal on eqiad/codfw :) [15:26:09] coolcool [15:26:24] will also have to make sure the config works under pybal [15:26:26] but the good new is that jayme will work with vgutierrez on unblocking IPIP on k8s this q :) [15:26:30] 🎉 [15:26:38] config should work as well with pybal [15:26:44] pybal does IPIP encapsulation nowadays [15:28:43] ah cool, I was wondering if that was true [15:29:19] yeah, we took care of that to make the migration to liberica/katran easier [15:29:39] yeah I sort of vaguely remember these discussions on the k8s ticket now [15:30:34] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227356 [15:33:09] just the right DC order... chef_kiss.gif [15:33:36] copy & paste for the win [15:34:02] lol you prefer numeric I see [15:34:18] once upon a time I was very used to east to west [15:37:04] I don't know if that was imposed to me by ema or just some old puppet manifests [15:37:21] but yes.. I'm used to numeric order ehre [15:37:22] *here [15:37:31] the higher numbers are a bit fuzzy for me :D [15:57:44] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1227363 [16:43:52] 06Traffic, 10ServiceOps-Services-Oids, 10ServiceOps new, 05WE4.2 Bot detection: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable - https://phabricator.wikimedia.org/T411191#11525673 (10MLechvien-WMF) @Raine could you triage this task (scheduled / in progress / backlog... [16:45:56] I think I'm gonna restart eqiad high-traffic1 pybal soonish [17:09:59] Jan 15 17:09:05 lvs1020 pybal[4056125]: [gerritlb_443] INFO: New enabled server cp1114.eqiad.wmnet, weight 1 [17:10:17] Jan 15 17:09:05 lvs1020 pybal[4056125]: [gerritlb6_29418] INFO: New enabled server tcp-proxy1001.eqiad.wmnet, weight 1 [17:11:22] :P [17:11:52] :o [17:12:14] fixed the alerts for "check-nft".. since we remove nftables [17:12:30] BTW.. you've skipped gerritlb_80 on purpose? :) [17:13:22] nothing wrong with that BTW [17:13:38] it's actually a nice experiment so see what happens with a https endpoint that doesn't have port 80 available [17:13:53] well, and, gerrit apache serves 403s on port 80 lol [17:15:24] so this actually seemed better [17:19:41] sure [17:21:15] 💙cdanis@wmftop ~ 🕧☕ DC=eqiad ; curl -I -X GET https://gerrit.wikimedia.org --connect-to ::gerrit-lb.${DC}.wikimedia.org ;nc -vW1 gerrit-lb.${DC}.wikimedia.org 29418 [17:21:17] :D [17:22:46] https://gerrit.wikimedia.org/r/1227391 [17:23:18] cdanis: https://gerrit.wikimedia.org/r/c/operations/puppet/+/859986 [17:23:29] ehehe I saw [17:23:31] and the bug [17:23:41] > Valentin and I clarified this is the first phase, the next step will be to remove port 80 / plain HTTP entirely later on. [17:23:44] we can call this done soon lol [17:23:48] :) [17:23:52] ✅ [17:28:30] any objections to me continuing with codfw? [17:33:10] nope [17:33:37] I'd done the opposite.. codfw first and eqiad later [17:34:04] hehe true [17:43:11] ok, we are live in codfw, so that's everywhere now :D [17:46:40] <3 [17:53:15] <3 [18:00:33] requesting review on https://gerrit.wikimedia.org/r/1227395 if anyone dares [18:04:22] * sukhe will risk it on a 35cm projected snowfall day [18:46:08] eek [18:50:05] sorry, just added a few more test cases too [18:50:29] are these doctests or just comments? [18:50:45] they are doctests! and I'm running them (by hand) with ./tunnelencabulator --self-test [18:50:50] ah nice [18:51:22] I am sure you know but for the dns repo, I added the doctests to tox.ini (captain obvious yes) [18:51:30] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/tox.ini#21 [18:52:20] yeah I don't think there's any CI in this repo at all heh [18:52:27] ah ok [18:52:31] looking again [18:53:17] replenerate_hostnames took me a while indeed [18:53:25] the word itself *repleneration* [18:53:37] 😅 [18:56:03] thanks! [18:57:49] gentle ping on this, I understood it was ok to merge, let me know if there's anything else I need: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1218817 [18:58:19] (doh, just realized I didn't add anyone to review - done) [18:58:20] milimetric: yes, I think we decided it was ok to merge during the meeting. let me get a +1 from v.g explicitly though? [18:58:25] I will take care of that [18:58:41] thx much, no rush, I know yall in a lot of stuff [18:59:12] no worries, left a comment [18:59:28] (I can +1 this since we discussed it but deferring to vg/fabu.r; fab is out) [19:31:04] 06Traffic, 10DNS, 06serviceops, 06SRE, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11526184 (10ssingh) This is typically done as part of a new wiki creation process, but Traffic is happy to help as required. [19:56:15] 06Traffic, 10DNS, 06serviceops, 06SRE, and 2 others: Set up DNS for abstract.wikipedia.org to be recognised - https://phabricator.wikimedia.org/T411724#11526339 (10Jdforrester-WMF) >>! In T411724#11526184, @ssingh wrote: > This is typically done as part of a new wiki creation process, but Traffic is happy...