[00:43:08] 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10818970 (10Papaul) @Jgreen @Dwisehaupt When do you think is best for me to work on this? Thank you. [00:44:10] 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10818971 (10Papaul) p:05Triage→03Medium [01:43:49] FYI VSV16 should not affect Wikimedia traffic, so long as HAProxy is in layer 7 mode instead of layer 4. [01:44:33] to exploit it you need an upstream reverse proxy that will send bogus chunked encoding, but I’m pretty sure that HAProxy always sends correct chunked encoding [01:45:20] specifically, HAProxy dechunks client data and then rechunks it (with correct delimiters) [01:46:22] What would be useful is to set the option to not forward trailers once Wikimedia is using HAProxy 3.2. Varnish can't handle trailers at all due to a bug in Varnish. [05:12:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258 (10Marostegui) 03NEW [05:12:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258#10819974 (10Marostegui) [06:01:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: Possible mgmt switch down in eqiad row D - https://phabricator.wikimedia.org/T394258#10820177 (10ayounsi) →14Duplicate dup:03T394109 [07:19:51] fabfur: hello! I'm planning on depooling esams today to upgrade its routers, is now a good time ? [07:34:24] I think it's good now [07:34:34] do you need any help with that [07:35:11] fabfur: thx, just to have someone from traffic around [07:36:17] ack [07:43:08] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820417 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=239f1d24-394b-4cd2-b80b-211b30b54a1a) set by ayounsi@cumin1002 for 1:00:00 on 3 host(s) and their servic... [08:14:04] rebooting cr2-esams (no impact expected except some alerting noise) [08:15:43] 06Traffic, 06collaboration-services, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271 (10Jelto) 03NEW [08:18:03] ack [08:19:32] we have a very high throughput in drmrs but that's expected I think [08:33:40] fabfur: yeah, I'm keeping an eye on https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?orgId=1&from=now-3h&to=now&timezone=utc&var-site=drmrs&var-instance=cr1-drmrs:9804&refresh=1m [08:33:54] arelion is running a bit hot, but so far so good [08:34:13] cr2-esams is done, going to start looking at cr1-esams, this one have 2 REs so it will take a bit longer [08:34:13] ok coffee time then [08:43:18] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820629 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0ccf059a-76d1-46d7-9ee7-b67d79c235aa) set by ayounsi@cumin1002 for 1:00:00 on 1 host(s) and their servic... [08:43:41] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820631 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ed684b09-6354-460a-9fbf-3df20fbe3f21) set by ayounsi@cumin1002 for 1:00:00 on 2 host(s) and their servic... [09:51:30] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10820938 (10ayounsi) [10:07:19] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10820991 (10cmooney) [10:08:27] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Stage and configure new Juniper switches in codfw rows E/F - https://phabricator.wikimedia.org/T394021#10820999 (10cmooney) [10:25:04] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821027 (10cmooney) @akosiaris is there any update on this one? If I recall correctly from our discussion at the SRE Summit the curr... [10:30:06] 06Traffic: Move method check from varnish to HAProxy - https://phabricator.wikimedia.org/T392073#10821064 (10Fabfur) 05Stalled→03Resolved [10:30:31] topranks: regarding that comment about increasing the MTU on k8s hosts.. that would also require increasing the MTU on the LVS as wel, right? [10:30:37] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821067 (10cmooney) [10:30:42] 06Traffic, 10Liberica, 13Patch-For-Review: Replace current L4LB with with Katran-based alternative - https://phabricator.wikimedia.org/T332027#10821069 (10cmooney) [10:31:05] 06Traffic, 10Liberica, 13Patch-For-Review: Replace current L4LB with with Katran-based alternative - https://phabricator.wikimedia.org/T332027#10821075 (10cmooney) [10:34:13] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821085 (10Vgutierrez) @cmooney that also implies increasing MTU on the LVS host as well, right? [10:36:09] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10821101 (10cmooney) [10:41:29] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821122 (10cmooney) [10:41:36] 06Traffic, 10Liberica, 13Patch-For-Review: Replace current L4LB with with Katran-based alternative - https://phabricator.wikimedia.org/T332027#10821124 (10cmooney) [10:48:11] Hey! I have a question before we rollout PCS from restbase to rest gateway for zhwiki. I have an endpoint that with the right `accept-language` it gives me the output with the expected variant. Both responses have `accept-language` in `vary`. [10:49:13] Will an edge PURGE for eg `zh.wikipedia.org/api/rest_v1/page/mobile-html/` purge all entries for all different `accept-language` variants? [10:54:44] <vgutierrez> nemo-yiannis: yes, at least for varnish. I'm struggling to find hard evidence for ATS at the moment [10:54:57] * vgutierrez sitting on a hospital waiting room at the moment [11:00:18] <vgutierrez> nemo-yiannis: but apparently ATS could require several requests (one per accept-language) [11:04:29] <wikibugs> 06Traffic, 10Liberica, 13Patch-For-Review: Replace current L4LB with with Katran-based alternative - https://phabricator.wikimedia.org/T332027#10821195 (10cmooney) [11:57:02] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821401 (10cmooney) >>! In T352956#10821085, @Vgutierrez wrote: > @cmooney that also implies increasing MTU on the LVS host as well,... [12:03:46] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821441 (10JMeybohm) >>! In T352956#10821401, @cmooney wrote: >>>! In T352956#10821085, @Vgutierrez wrote: >> @cmooney that also impl... [12:07:38] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821470 (10Vgutierrez) That would be enough to accommodate IPv4 and IPv6? We currently clamp at 1440 bytes for ipv4 and at 1400 bytes... [12:22:03] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821513 (10Vgutierrez) Nevermind, we only do ipv4 for low-traffic/internal services [13:12:53] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821714 (10cmooney) >>! In T352956#10821470, @Vgutierrez wrote: > That would be enough to accommodate IPv4 and IPv6? We currently cla... [13:23:11] <wikibugs> 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10821795 (10ayounsi) Opened JTAC case 2025-0514-696857 for the management switches (EX4300) [13:31:45] <wikibugs> 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10821832 (10Jgreen) >>! In T393996#10818970, @Papaul wrote: > @Jgreen @Dwisehaupt When do you think is best for me to work on this? Thank you. We have a frack maintenance week starting... [13:33:09] <wikibugs> 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10821835 (10cmooney) >>! In T393996#10821832, @Jgreen wrote: >>>! In T393996#10818970, @Papaul wrote: >> @Jgreen @Dwisehaupt When do you think is best for me to work on this? Thank you.... [13:46:41] <wikibugs> 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10821894 (10akosiaris) >>! In T352956#10821027, @cmooney wrote: > @akosiaris is there any update on this one? > > If I recall correct... [14:33:18] <wikibugs> 06Traffic, 06SRE: Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312 (10ssingh) 03NEW [14:49:56] <wikibugs> 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10822351 (10Papaul) @ayounsi @cmooney siice i am out that week can someone take over this or wait when i am back . thanks [15:00:26] <wikibugs> 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10822441 (10Jhancock.wm) 05Open→03Resolved @BCornwall the alert has cleared in the idrac and I dont't see anything new in the history since yesterday. We mi... [15:01:40] <wikibugs> 10netops, 06Infrastructure-Foundations: Downgrade pfw1-codfw to Junos 23.4R2-S3 - https://phabricator.wikimedia.org/T393996#10822465 (10cmooney) a:05Papaul→03cmooney >>! In T393996#10822351, @Papaul wrote: > @ayounsi @cmooney siice i am out that week can someone take over this or wait when i am back . tha... [15:12:40] <fabfur> going to remove varnishkafka from magru (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145948), eventual alerts are on me [15:23:09] <vgutierrez> plz !log it [15:23:50] <vgutierrez> nemo-yiannis: regarding your question from this morning, you're ok, PURGE /url will take care of wiping all variants/alternates of the URL [15:24:07] <vgutierrez> thx for the question BTW, it helped me discovering an issue with an ongoing project :D [15:24:17] <nemo-yiannis> cool, thanks for the information! [15:49:51] <sukhe> cdanis: from the pope discussion the other day https://phabricator.wikimedia.org/T394312 :) [15:50:00] <sukhe> (any feedback, please add to the task) [15:50:15] <sukhe> XioNoX: topranks: ^ for your awareness; let us know if there is any feedback [15:58:58] <topranks> I think it should be ok [15:59:24] <wikibugs> 06Traffic, 06collaboration-services, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10822942 (10bd808) > a new hostname like gerrit-git.wikimedia.org (tbd) for SSH/Git. Naming is always tricky, but I wonder if putting `ssh` in the... [16:05:55] <wikibugs> 06Traffic, 06collaboration-services, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10822985 (10bd808) > Implementing this change would require a lot of refactoring across various tools and automation in CI, Puppet, and local reposi... [16:15:48] <wikibugs> 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#10823034 (10ssingh) Hi @greg: (Using this task is perfectly fine, thanks). The delegation for wiki.gives will be done at Markmonitor's (regist... [16:18:04] <wikibugs> 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#10823058 (10greg) Makes sense, thank you @ssingh ! Just to be clear in my mind: Are the markmonitor changes done by Traffic or Legal/Trademark... [16:42:57] <wikibugs> 06Traffic, 06DC-Ops, 10ops-codfw, 06SRE: hw troubleshooting: Memory failure for cp2029.codfw.wmnet - https://phabricator.wikimedia.org/T393968#10823168 (10BCornwall) Thank you! [17:04:54] <wikibugs> 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#10823265 (10ssingh) @greg: The change has been made and Markmonitor has been updated. `NS` records have a fairly large TTL of 86400 seconds or... [17:13:19] <wikibugs> 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#10823326 (10greg) Awesome, thank you very much @ssingh ! I'll let the Acoustic side know now. [19:57:38] <inflatador> sukhe ebernhardson ryankemper just a heads-up that I'm gonna start merging the patches from yesterday , starting with https://gerrit.wikimedia.org/r/c/operations/dns/+/1145276/1 [19:58:29] <ryankemper> inflatador: ack, there's some small merge conflicts fyi [19:58:58] <inflatador> cool, let me work on that first [20:01:28] <sukhe> inflatador: fwiw I have to step out for a bit soon [20:01:39] <sukhe> but it should be OK [20:17:59] <inflatador> so far so good. Running dns.netbox cookbook now [20:20:47] <sukhe> nice [20:24:29] <inflatador> OK, pooling w/confctl [20:26:20] <inflatador> running authdns-update... [20:27:13] <inflatador> and...failure. Damn [20:27:19] <sukhe> sigh [20:27:29] <sukhe> what is it this time? [20:27:35] <inflatador> `Invalid resource name 'disc-search-psi-https' detected from zonefile lookup` [20:27:46] <inflatador> I have a tmux open on dns1004 if you wanna sudo to me and have a look [20:28:36] <brett> again? [20:28:47] <ebernhardson> hmm, that one is in services.yaml, and has been for some time. It should be available [20:29:42] <inflatador> do I need to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143622/7 before running authdns-updatE? [20:30:29] <ebernhardson> inflatador: no i don't think so, that one should be after everything else. Although i could be misunderstanding and those dnsdisc: entries are critical earlier? [20:31:02] <sukhe> inflatador: yeah, merge that [20:31:09] <inflatador> ACK, merging now [20:31:50] <sukhe> let me know when done; I want to run authdns-update [20:31:57] <sukhe> er, run-puppet-agent on A:dnsbox [20:32:37] <inflatador> OK, puppet-merge is finished [20:32:53] <sukhe> ok trying [20:33:22] <sukhe> I think if this doesn't work, let's set up some dedicated time for this but I am still missing a lot of context even though your patches have been helpful [20:34:04] <inflatador> Sounds good, I am off the rest of the week but we can hit it next week or whenever. It's not a blocker for the current OpenSearch migration [20:35:46] <sukhe> ok failure to reload but that's fine [20:35:52] <sukhe> state files created [20:36:01] <sukhe> waiting for it to finish and will run authdns-update again [20:36:43] <inflatador> 🫰 [20:36:53] <sukhe> nope [20:38:27] <sukhe> inflatador: yeah we are missing something and it's time to hit the drawing board. please find some time on my calendar for when you are back [20:38:31] <sukhe> let's revert [20:39:02] <inflatador> ACK, rolling back patches now [20:39:05] <sukhe> thanks [20:42:29] <sukhe> did you revert the puppet one too and merged? [20:42:36] <inflatador> not yet, I had a question about the conftool stuff [20:43:13] <sukhe> that should not affect anything else I feel [20:43:21] <sukhe> as long as it just exists in etcd, it's not a big deal [20:43:26] <inflatador> maybe it's already covered in the docs, but I ran commands like `confctl --object-type discovery select 'dnsdisc=search' set/pooled=true` , I'm guessing I need to do something before removing anything? [20:43:41] <inflatador> in that case I won't worry about it. reverting puppet stuff now [20:44:31] <sukhe> I am not 100% sure but I think as long as we can get it to a working state, we should be fine. [20:44:47] <sukhe> as in, non-broken state and working before patch was merged [20:44:51] <inflatador> except it sounds like I don't need to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1145278/2 ? I don't mind either way, just don't want to set off alerts [20:45:08] <sukhe> yes I think you can leave this one [20:45:27] <inflatador> cool, I am gonna revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143622/8 though [20:45:32] <sukhe> yes this [20:46:15] <sukhe> I am not even sure why this didn't work. it created the state files as well, clearly. [20:48:11] <inflatador> OK, that's reverted and I just puppet-merged it [20:48:14] <sukhe> thank you [20:48:18] <inflatador> NP, thanks for standing by [20:48:28] <sukhe> running agent on A:dnsbox [20:49:14] <sukhe> most likely it is an ordering issue we are getting wrong. [20:49:34] <sukhe> as in the order of merging the changes. there are various warnings along the way about this and clearly I missed them [20:49:52] <inflatador> Could be. I tried following the order of the patches + the linked docs, but entirely possible I missed something [20:52:48] <sukhe> ok gdnsd is fine at least [20:52:53] <sukhe> rolling out [20:58:02] <sukhe> inflatador: anything missing that you can see? [20:58:08] <sukhe> dns hosts are OK, gdnsd is fine [20:58:13] <sukhe> but I might be missing someting [20:58:46] <inflatador> I think everything's OK. We're not using discovery at all for search yet [20:59:29] <sukhe> ebernhardson: to answer your earlier question, per https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service, first the service defintion goes in and then at the very end authdns-update [20:59:59] <sukhe> so I am wondering if we got the order wrong there (and the order matters) [21:00:21] <sukhe> essentially in this case, we had the gdnsd failure *first* an that makes sense in hindsight [21:01:38] <ebernhardson> sukhe: ok, i suppose that makes sense for the discdns: entries [21:01:56] <sukhe> though I am not sure why the service is failing now and if it is related :) [21:05:51] <wikibugs> 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10824298 (10ayounsi) No luck: ` Thank you for the information provided. As I have verified on the device and in Pathfinder - Fea... [21:27:34] <inflatador> Per https://phabricator.wikimedia.org/T143553 , we're going to hold up these changes until after the OpenSearch migration . That should give us some time to really focus