[04:08:44] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647 (10Papaul) 03NEW [04:09:30] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11695826 (10Papaul) p:05Triage→03High a:05cmooney→03ayounsi [06:11:20] 10netops, 06Infrastructure-Foundations, 10Observability-Logging: ~5k/logs/sec from netdev - https://phabricator.wikimedia.org/T412143#11695955 (10ayounsi) From JTAC: > I hope you are doing well. Our engineering team has found a fix for this behavior. However, the release you are running, 22.2, is already EoL... [06:16:00] 10netops, 06Infrastructure-Foundations, 10ops-magru, 06SRE: cr2-magru <-> asw1-b3-magru link down March 2026 - https://phabricator.wikimedia.org/T418978#11695964 (10ayounsi) Awesome, thx!! [07:28:53] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696068 (10ayounsi) Using this opportunity to test my WIP rack depool cookbook (only in "show" mode). More info in {T327300} That's the current status of what... [07:43:02] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696088 (10ops-monitoring-bot) Draining ganeti1033.eqiad.wmnet of running VMs [07:43:40] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696089 (10MoritzMuehlenhoff) [08:20:09] sukhe: yeah, connecting from China, that's unrelated though. [08:21:16] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696205 (10MatthewVernon) [08:22:24] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696207 (10MatthewVernon) Can I check this is 15:00 UTC (particularly given daylight confusion...), please? Once it's done I'll check ms-be1091 [the frontends c... [08:26:28] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696225 (10MoritzMuehlenhoff) [08:26:30] The SNI wikimedia-dns.org seems to be already blocked in China, can't say it's targeted though, I guess the GFW does active probes of /dns-query to any SNI. [08:30:10] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696229 (10ayounsi) >>! In T419647#11696205, @MatthewVernon wrote: > Can I check this is 15:00 UTC (particularly given daylight confusion...), please? Once it's... [08:31:09] https://github.com/net4people/bbs/issues/68 [08:52:32] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696297 (10MatthewVernon) Ah, I just put `10:00 EST` into `date`. You're probably right, but a confirmation would be helpful :) [09:18:43] What's the best place to discuss requests to the bot-traffic mailing list? I'm asking because Tom Brewe seems to be waiting for a response still, and I'd like to tell them... something? Namely, I'm not seeing any significant traffic from their app. Turnilo show 4 hits (times 128 sample rate) from 2 IPs. That shouldn't trigger rate limiting... But maybe some of you have better info? [09:19:20] Can I somehow see how many requests got rate limited for a given user agent (or ideally, substring)? [09:23:44] 06Traffic: Wikimedia Commons: incorrect 429 responses for thumbnail errors - https://phabricator.wikimedia.org/T419663#11696376 (10taavi) [09:32:58] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696410 (10MoritzMuehlenhoff) [10:06:16] 10Wikimedia-Apache-configuration, 06ServiceOps new, 06SRE, 10Wikibase GraphQL, and 2 others: Create a rewrite for the GraphQL endpoint on wikidata.org - https://phabricator.wikimedia.org/T417026#11696531 (10Clement_Goubert) Ack, thanks for following up. [10:16:14] 10netops, 06Infrastructure-Foundations: Nokia: implement maintenance mode - https://phabricator.wikimedia.org/T419673 (10ayounsi) 03NEW p:05Triage→03Medium [10:17:48] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696558 (10taavi) [10:23:06] 10netops, 06Infrastructure-Foundations, 10Prod-Kubernetes, 06ServiceOps new, 06SRE: Eqiad: lsw1-d7-eqiad BGP maintenance - https://phabricator.wikimedia.org/T418772#11696564 (10ayounsi) 05Open→03Resolved All done. [10:30:55] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696597 (10BTullis) [10:31:35] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11696605 (10BTullis) [10:34:04] duesen: you can see that in turnilo filtering per user agent and status [10:46:19] 10netops, 06Infrastructure-Foundations, 06ServiceOps new, 06SRE: Nokia SR-Linux DHCP Relay Bug - https://phabricator.wikimedia.org/T411054#11696665 (10BTullis) Will all of the switches in rows C & D be getting this configuration change? I'm asking because I've got another host that is exhibiting a reimage... [10:57:25] vgutierrez: yes, but only at 1/128 sample [10:58:41] duesen: that's true [10:59:00] which rules are being triggered? [10:59:15] you should have that data in turnilo on X-requestctl [11:00:11] none... i don't see a single 429 there. [11:00:19] let me zoom out to more days [11:19:51] FIRING: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:20:16] FIRING: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:20:46] FIRING: SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:25:46] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:39:51] 10Acme-chief, 06Traffic, 07Upstream: acme-chief is unable to validate challenges against GTS staging environment - https://phabricator.wikimedia.org/T419352#11696983 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [11:40:16] RESOLVED: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:40:46] FIRING: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:44:51] RESOLVED: SLOMetricAbsent: varnish-combined ulsfo - https://slo.wikimedia.org/?search=varnish-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:45:46] RESOLVED: [2x] SLOMetricAbsent: haproxy-combined - https://slo.wikimedia.org/?search=haproxy-combined - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:13:57] vgutierrez: I was unsure if we did use the regex_remap plugin in the end configuration or just this https://docs.trafficserver.apache.org/en/latest/admin-guide/files/remap.config.en.html#regular-expression-regex-remap-support [12:19:04] afaict we are using regex_map and not map http://a.com http://b.com @plugin=regex_remap.so @pparam=maps.reg so https://gerrit.wikimedia.org/r/c/operations/puppet/+/1245389/comment/97d45ac2_eec177ca/ wouldn't work as is? [12:19:15] Or am I completely misreading the doc? [12:41:56] yeah.. you would need to load regex_remap.so [12:42:03] but that should be OK [12:43:07] so it would be a `map` rather than `regex_remap` [12:43:10] yep [12:43:36] or you can use the simpler and more verbose approach of the three map entries [12:43:41] it's not the end of the world IMHO [12:43:47] And it also makes it weird for gateway-check.lua because it expects the non-remapped host [12:43:58] I feel like the 3 map entries is more legible [12:44:05] ack [12:44:19] If it's ok with you wrt config size/perf/etc. I think I'll go that way [12:44:51] Unless the weird device-analytics path can't be deprecated, in which case I'll have to find something else (but it looks like it gets literally 0 traffic) [13:01:46] ack [13:30:19] 10Acme-chief, 06Traffic, 07Upstream: acme-chief is unable to validate challenges against GTS staging environment - https://phabricator.wikimedia.org/T419352#11697352 (10ssingh) Thanks for identifying and fixing this, @Vgutierrez! [14:19:37] 06Traffic: Revisit HAProxy cpu-map directive usage - https://phabricator.wikimedia.org/T419568#11697654 (10BBlack) Reviewing the current situation, using haproxy-3.0 documentation as a guide: Our config of `nbthread` and `cpu-map` is driven by these snippets of the ERB template (irrelevant parts elided): ` <%... [14:29:31] 06Traffic: Revisit HAProxy cpu-map directive usage - https://phabricator.wikimedia.org/T419568#11697837 (10BBlack) Separately, drifting off into "thoughts for the future when we have time to pursue it": none of the options we have here have been rigorously compared under adversarial conditions. Ideally we'd pro... [14:29:45] 06Traffic, 13Patch-For-Review: x-provenance header: identify WMCS - https://phabricator.wikimedia.org/T411503#11697839 (10HCoplin-WMF) Just chiming in that we are seeing a few more scenarios where it would be helpful to differentiate WMCS traffic from all "internal"/known traffic. This is an example where it w... [14:31:46] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11697857 (10ssingh) Hi folks. I confirmed with Valentin that we don't need the public IPs, `pybal-high-traffic1-ulsfo.wikimedia.org` and `pybal-high-... [14:39:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11697929 (10ssingh) Sorry, @ayounsi reminded me that the main purpose of this task is to figure out what to do about the other public IPs. We will ne... [14:41:31] 06Traffic, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10Wikimedia-Fundraising-CiviCRM, 07fr-acoustic: Acoustic SMS: Domain needed for short links - https://phabricator.wikimedia.org/T379318#11697939 (10ssingh) Thanks for confirming folks. No action required on this then. [14:45:07] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11697954 (10ssingh) @BCornwall: DC-Ops has recommended in the past to try rebooting the server again to see if the issue resolves. I am not saying it is the same but perhaps ca... [14:46:53] Cuthead: yes, indeed, where you are connecting from should not change anything [14:47:16] we identified the issue: what happened was (we think) that we rolled out the IPv6 change for the auth nameservers [14:47:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250576 should fix it but on the wikidough side, we need to dig a bit deeper on how the ECS is propagated [14:47:45] so we will work on that, thanks for bringing it to our notice [15:02:28] Not sure what which issue that fix...? [15:03:36] The most weird behavior for me is that there's possibility of ECS failure/working. [15:04:56] Cuthead: so in Wikimedia DNS, we only send ECS to queries destined for our own nameservers [15:05:04] and right now we just have the v4s in there [15:05:56] so there's that fix, where we will add the v6 addresses, so we send edns-client-subnet for queries destined for v6 records as well [15:06:20] and then there is the more complicated question of how dnsdist forwards ECS to pdns-rec and further to our nameservers [15:06:29] that requires a bit more research [15:13:27] Doesn't make sense to me. You can have a IPv6 subnet in ECS option, regardless of v4 or v6 stack. [15:15:27] And vice versa, having an IPv4 ECS subnet in a IPv6 packet is also possible. [15:16:37] Cuthead: yes, you can. what is broken now probably but like I said needs further confirmation is this bit in pdns-rec.conf, which receives queries from dnsdist [15:16:40] edns-subnet-allow-list=208.80.154.238, 208.80.153.231, 198.35.27.27 [15:18:40] basically, we just added IPv6 addrs to our wikimedia.org NS records, and our public resolver is probably reaching our internal authdns over IPv6, and we've only allow-listed it to forward ECS data to the IPv4 authserver IPs. [15:19:24] Ok, I understand it now. [15:19:56] the reason this theory may work is because well, we do have a test suite of sorts that was running fine before the IPv6 change but now it is broken as well [15:20:05] and given that that change is recent (last week), it seems to be the culprit [15:20:17] unless there is further stuff broken down the chain, we will see when we roll it out and test that [15:51:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-ulsfo, and 2 others: ULSFO: Update ULSFO LVS service IP's - https://phabricator.wikimedia.org/T418971#11698383 (10ssingh) @Jgreen / @Dwisehaupt: `donate-lb.ulsfo.wikimedia.org` is the same IP as `text-lb.ulsfo.wikimedia.org` and that will change as pa... [16:06:59] vgutierrez: I added a small paragraph about regex_map to the ATS wikitech page so that it's clearer for future me what can and can't be done with it :P [16:24:22] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11698537 (10RLazarus) Service Ops triage here: Agreed there's nothing for us to do, thanks @ayounsi - untagging us. [16:33:11] claime: <3 [16:36:00] FIRING: AnycastHealthcheckerRestarted: anycast-healthchecker service restarted on durum4003:9100 - https://wikitech.wikimedia.org/wiki/Anycast#Anycast_healthchecker_not_running - https://grafana.wikimedia.org/d/dxbfeGDZk/anycast?orgId=1&var-protocol=BGP&var-site=ulsfo&var-cluster=All&var-ip_version=All - https://alerts.wikimedia.org/?q=alertname%3DAnycastHealthcheckerRestarted [16:39:58] FIRING: [2x] SystemdUnitFailed: anycast-healthchecker.service on durum4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:41:59] checking what's up [16:42:13] reimaged yesterday [17:17:49] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698841 (10RobH) a:03RobH [17:18:50] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698846 (10RobH) Dell will require all the firmware and such be the latest versions before they call it a failure, so I'll steal this and update the firmware on this host and... [17:27:18] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11698888 (10RobH) [18:57:06] Cuthead: dig @185.71.138.138 +subnet=103.102.166.224/24 +https dyna.wikimedia.org A +short [18:57:09] 103.102.166.224 [18:57:18] it should be fine now, so please try it and let us know [18:57:43] it was the v6 missing addresses missing from edns-subnet-allow-list [19:01:07] thanks for reporting btw. in theory, we do have https://gitlab.wikimedia.org/repos/sre/knead-wikidough to prevent this but it needs to run more often [19:01:48] specifically, we have tests to check this and also to ensure that we don't send ECS to non-wikimedia auth servers (https://gitlab.wikimedia.org/repos/sre/knead-wikidough/-/blob/main/tests/test_dns.py?ref_type=heads#L230) [19:03:45] Yes, confirmed to be working now. [19:07:16] ok nice! [19:09:18] 06Traffic: Decommission codfw cp hosts - https://phabricator.wikimedia.org/T419753 (10BCornwall) 03NEW [19:09:33] 06Traffic: Decommission codfw cp hosts - https://phabricator.wikimedia.org/T419753#11699404 (10BCornwall) 05Open→03In progress p:05Triage→03Medium [19:19:43] 06Traffic: Decommission codfw cp hosts - https://phabricator.wikimedia.org/T419753#11699424 (10ssingh) Thanks Brett. @Fabfur cp2041 and cp2042 are yours for the OpenSSL testing. [19:30:26] 06Traffic: Ensure periodic and automatic runs of knead-wikidough - https://phabricator.wikimedia.org/T419754 (10ssingh) 03NEW [19:30:48] 06Traffic: Ensure periodic and automatic runs of knead-wikidough - https://phabricator.wikimedia.org/T419754#11699449 (10ssingh) p:05Triage→03Medium [19:34:25] 10netops, 06Infrastructure-Foundations, 06SRE: Eqiad: lsw1-d2-eqiad BGP maintenance - https://phabricator.wikimedia.org/T419647#11699470 (10MoritzMuehlenhoff) [19:59:53] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11699554 (10BCornwall) [21:08:31] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11699926 (10BCornwall) [21:41:52] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700050 (10RobH) I had some ISP issues with upload speeds to magru, so Papaul helped me out and flashed the firmware for idrac, bios, and backplane. The error persists, so I'... [22:47:55] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700241 (10RobH) a:05RobH→03BCornwall @BCornwall, After firmware updates and resetting the SEL and rebooting the issue now seems to have cleared up. The collection log... [22:48:28] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700244 (10RobH) [22:48:39] 06Traffic, 06DC-Ops, 10ops-magru: hw troubleshooting: Comm Error: Backplane 0 for cp7012 - https://phabricator.wikimedia.org/T419611#11700247 (10RobH) [23:57:07] 06Traffic: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832#11700366 (10BCornwall)