[03:49:32] 06Traffic, 10MobileFrontend, 10Data-Engineering (Q3 2025 January 1st - March 31th): Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924 (10Krinkle) 03NEW [03:50:01] 06Traffic, 10MobileFrontend, 10Data-Engineering (Q3 2025 January 1st - March 31th): Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10706593 (10Krinkle) [07:16:32] vgutierrez: ask and ya shall receive. See https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1133745. I already have a manually patched eqiad deployment though [07:16:45] So, curl --http1.1 --parallel-max 5 -Z -s -o /dev/null -v --connect-to www.wikifunctions.org:443:mw-wikifunctions-ingress.discovery.wmnet:30443 [07:16:45] 'https://www.wikifunctions.org/w/api.php?action=query&format=json&list=wikilambdaload_zobjects&wikilambdaload_zids=Z1%7CZ2%7CZ12%7CZ11%7CZ3%7CZ4%7CZ6%7CZ8%7CZ7%7CZ9%7CZ40%7CZ41%7CZ42%7CZ14%7CZ1002%7CZ881%7CZ18%7CZ60%7CZ1001%7CZ1003%7CZ1004%7CZ1005%7CZ1672%7CZ1645&wikilambdaload_language=en&wikilambdaload_get_dependencies=true&vgutierrez=[1-200]' |& [07:16:45] grep -E '404|server' [07:17:00] returns indeed the names of the pods that serve the request [07:17:04] also, 0 404s [07:17:53] awesome [07:18:10] -H 'Connection: close' if you wanna force a new connection per request BTW [07:18:37] also.. I don't know if it's a feature or a bug but that ingress has HTTP/2 enabled [07:21:58] I'm reapplying the patch on cp3066 [07:23:04] 06Traffic, 10MobileFrontend: MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929 (10Krinkle) 03NEW [07:27:30] 06Traffic, 10MobileFrontend: MobileFrontend should declare "X-Subdomain" variance via "Vary" response header - https://phabricator.wikimedia.org/T390929#10706730 (10Krinkle) @Joe @BBlack What if any precautions or planning should we take into account with this? Is this low-risk to "just" add and ride the trai... [07:28:55] all ingresses do apparently. It's by default on? We can disable it globally I guess, one less thing to have to debug [07:29:04] plus, what the point of HTTP/2 in the DC [07:29:15] all the goodies are for end users anyway [07:29:29] and there's some horror stories about HTTP/2 in the DC too [07:31:30] ATS got it disabled currently for outbound connections [07:31:55] ah, I was about to point to https://istio.io/latest/docs/ops/common-problems/network-issues/#404-errors-occur-when-multiple-gateways-configured-with-same-tls-certificate which could be an explanation [07:32:00] I'm unable to reproduce the issue at the moment with cp3066 pooled BTW [07:32:04] but if ATS doesn't have HTTP/2 then no point [07:32:20] should I give it a try too? [07:32:36] well.. ATS definitely reuses connections [07:32:40] even if it's HTTP/1.1 [07:32:58] just reproduced it using siege [07:33:17] you should be able to see it in atslog-backend [07:33:40] but it's very weird that I can't reproduce it when using siege against the ingress directly and not via ATS [07:34:07] 16/200 requests resulted in a 404 [07:34:55] so you have several services behind that ingress? [07:35:04] yes [07:35:12] many up to now [07:35:18] this is the first time we see this [07:35:25] hmmm [07:35:33] can you point me to the services behind that ingress? [07:35:47] * vgutierrez reverting cp3066 state [07:36:50] vgutierrez: https://phabricator.wikimedia.org/P74589 [07:37:23] this one is the only one that also has a secondary filtering mechanism that chooses between 3 different backends [07:37:30] but the 3 backends are identical right now [07:37:49] the idea being that they will eventually be MWs group0, group1 and group2 [07:39:47] ok.. [07:39:51] I've found the issue [07:40:01] * akosiaris prepares himself [07:40:31] pparam=proxy.config.http.server_session_sharing.match=ip [07:40:53] you're telling ATS that's ok to share connections to the same IP for wikifunctions.org [07:41:20] so if ATS has a connection opened to another service behind that ingress like miscweb [07:41:23] is gonna try to use it [07:42:01] ok, that explains why when we depooled the host we no longer could reproduce it [07:42:04] if we have several services behind the same IP we need to ditch that [07:42:30] it's an interesting race condition. [07:42:40] how much of an optimization is that? [07:42:53] as in, if we had to do this for all of MediaWiki traffic, how bad would it be? [07:43:01] pretty bad [07:43:20] given that en.wikipedia.org, en.m.wikipedia.org, es.wikipedia.org and es.m.wikipedia.org would require different connections [07:43:57] and would all require for TLS to be negotiated anew [07:44:05] indeed [07:44:21] TLS 1.3 is pretty fast in that sense but still [07:44:42] wikifunctions is ok, mw-web and mw-api isn't [07:44:58] I think it's worth the effort of having two dedicated IPs for those two [07:46:05] we can probably move all of the mw deployments behind a dedicated IP anyway and have the setting on for those [07:46:14] s/those/that IP/ [07:46:43] 06Traffic, 13Patch-For-Review: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912#10706775 (10MoritzMuehlenhoff) JFTR, this also fixes **CVE-2024-53868 - Chunked message body allows request smuggling** https://www.openwall.com/lists/oss-security/2025/04/02/4 https://github.com/apache/trafficse... [07:47:14] akosiaris: that's ok if the ratio of 5xx between services is similar [07:47:56] good point, I 'll note it down to look into that [07:48:01] for example, we got a dedicated entry for wikidata.org where we disable session reuse per IP cause the amount of 5xx for wikidata is way higher than let's say en.wp [07:48:21] and that harmed connection reuse [07:49:04] yeah looking at the comment in the configuration right now [07:49:06] interesting [07:51:02] anyway, I wasn't planning on rolling out this to all wikis anyway right now, but later. I 'll makes notes and put them in a phab task [07:51:25] for now, a patch for wikifunctions without ip sharing should do it [07:58:01] vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133812 [07:59:40] thanks for all the help btw. [08:00:18] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 13Patch-For-Review, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10706790 (10akosiaris) The issue was found. It's effectively a race condition. We figured out tha... [08:04:29] akosiaris: no problem [08:30:49] 06Traffic: haproxykafka minor features - https://phabricator.wikimedia.org/T374128#10706861 (10Fabfur) [08:31:16] 06Traffic: haproxykafka minor features - https://phabricator.wikimedia.org/T374128#10706865 (10Fabfur) [08:31:18] 06Traffic, 13Patch-For-Review: Enable SSL client authentication on haproxykafka - https://phabricator.wikimedia.org/T379776#10706864 (10Fabfur) [08:31:22] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10706866 (10Fabfur) [08:31:29] 06Traffic: haproxykafka minor features - https://phabricator.wikimedia.org/T374128#10706869 (10Fabfur) [08:31:30] 06Traffic, 10Sustainability (Incident Followup): Avoid logging errors per produced message - https://phabricator.wikimedia.org/T380583#10706868 (10Fabfur) [08:32:08] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10706872 (10Fabfur) [08:32:10] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10706873 (10Fabfur) [08:32:26] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration: Make webrequest_frontend being ingested using the in-data `dt` field - https://phabricator.wikimedia.org/T388397#10706874 (10Fabfur) [08:32:32] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10706875 (10Fabfur) [08:33:05] 06Traffic: Split MessageBuffer configuration for different processing channels - https://phabricator.wikimedia.org/T386801#10706876 (10Fabfur) [08:33:06] 06Traffic: haproxykafka minor features - https://phabricator.wikimedia.org/T374128#10706877 (10Fabfur) [08:33:09] 06Traffic, 06Data-Engineering, 06Data-Engineering-Radar, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review: New software: haproxykafka - https://phabricator.wikimedia.org/T370668#10706878 (10Fabfur) [08:42:39] 06Traffic, 10Liberica: Provide NTP healthchecks - https://phabricator.wikimedia.org/T389212#10706898 (10Vgutierrez) 05Open→03In progress p:05Triage→03Medium [08:43:35] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 13Patch-For-Review, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10706902 (10akosiaris) 05Open→03Resolved a:03akosiaris I 'll resolve this. The fix has... [10:01:00] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp4047:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=ulsfo%20prometheus/ops&var-instance=cp4047 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [10:02:09] ^^ downtimed for 15d [10:13:02] thx [10:25:01] 10netops, 06Infrastructure-Foundations, 06SRE: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707237 (10cmooney) [10:25:05] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 13Patch-For-Review: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641#10707238 (10cmooney) [10:35:03] 10netops, 06Infrastructure-Foundations, 06SRE: Create alerting for saturation on sub-rated interfaces - https://phabricator.wikimedia.org/T374614#10707267 (10cmooney) >>! In T374614#10147994, @ayounsi wrote: > Short term I think if you add `[4Gbps]` to the interface description, LibreNMS will [[ https://docs... [10:45:50] 10netops, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Improve port-utilisation alerting to take QoS into account - https://phabricator.wikimedia.org/T384052#10707299 (10cmooney) [10:52:08] 06Traffic: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227#10707327 (10Fabfur) [11:17:03] Hiii, I'm trying to add verified domains to HIBP. There are two ways, email to postmaster@domain or a TXT record. I did TXT record for most of non-dyna ones but for majority of them, it's better not to add a txt record to dyna. OTOH, I think the mail routing/MX is not working. I sent an email to postmaster@de.wikipedia.org. *One day* later I got bounce response with this: [11:17:03] > The recipient server did not accept our requests to connect. For more information, go to https://support.google.com/mail/answer/7720 [de.wikipedia.org 2620:0:861:ed1a::1: FAILED_PRECONDITION: connect error (111): Connection refused] [de.wikipedia.org 208.80.154.224: FAILED_PRECONDITION: connect error (111): Connection refused] [11:17:03] (208.80.154.224 is text-lb in eqiad) [11:17:03] What do you recommend me to do? [11:18:27] we could set MX record for dyna to mx-in100x? I have no idea how much of flood it might cause 😅 [11:57:35] Amir1: hmm [11:58:15] we have to be careful playing around with dyna [11:59:01] give me some time to come online and then let's talk about it? [12:01:34] yeah sure! [12:01:40] not urgent anymore [12:10:44] 06Traffic, 10MobileFrontend, 10Data-Engineering (Q3 2025 January 1st - March 31th): Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10707573 (10phuedx) > Other ideas? Preferences? IIRC Varnish is the decision maker in production – MobileFrontend simply responds to th... [12:45:31] 10netops, 06Infrastructure-Foundations, 06SRE: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10707705 (10cmooney) [13:22:16] on the list of domains: Most large projects with their mobile domains. On non wikis, I wrote some in https://phabricator.wikimedia.org/P74552 but I will add more [13:26:07] this may be a dumb question but why do they need to verify ownership of specific subdomains? [13:28:19] see lang editions are CNAMES to dyna, I see no way of doing this without having dyna return TXT records as well. and since I don't know of any way in gdnsd to do that, that would mean we return those records by default to all dyna lookups, which is not ideal of course [13:28:26] s/see/since [13:33:33] [note that we cannot, by definition, return additional records if a CNAME is there. I was thinking of modifying dyna in such a way that we return separate A and TXT records, but that has the same complication as above] [13:39:20] bblack is out this week so if you want to wait, we should run it by him [13:45:03] 06Traffic, 06Abstract Wikipedia team, 06serviceops, 07Wikimedia-production-error: Partial mw-wikifunctions outage; 404s on load.php and others? - https://phabricator.wikimedia.org/T390854#10707998 (10Jdforrester-WMF) >>! In T390854#10706902, @akosiaris wrote: > I 'll resolve this. The fix has worked, t... [14:03:47] sukhe: to answer "why do we need to verify each sub-domain separately" That is a very valid question that has been frustrating me a lot. Here is also other people complaining about it https://haveibeenpwned.uservoice.com/forums/275398-general/suggestions/9421500-allow-root-domain-to-verify-subdomains [14:04:11] i.e. HIBP doesn't allow it [14:04:21] :( [14:04:47] yeah... [14:05:27] Amir1: the problem is that unless we can actually announce the MX records (which we can't, at least I don't see a way right now), you can't even do email verification [14:05:37] 06Traffic, 10Liberica: Provide NTP healthchecks - https://phabricator.wikimedia.org/T389212#10708099 (10Vgutierrez) per IRC conversations with @ssingh we should validate the stratum as well [14:05:56] if you remember, these are some of the same concerns that we mentioned in the doc on the DC experiment [14:06:44] that because of how we route users on dyna, we really can't for example return different IPs on lang or project [14:08:00] I mean, we can add the mx record to dyna? the same way we do with wikimedia.org root [14:08:05] https://www.irccloud.com/pastebin/DpDkSdJd/ [14:08:24] but I have no idea whether mx-in is set up to take the email flood [14:09:00] but maybe I'm misunderstanding why we can't set a mx record for dyna [14:09:12] regardless, it's quite risky even if it's possible [14:09:44] Amir1: that's an MX record for @, which is different for setting an MX for dyna. [14:10:55] dyna is defined as: dyna 300 IN DYNA geoip!text-addrs [14:11:48] and then all other lang editions are CNAMEs to this, so it can't be set on those and not on dyna itself. [14:15:03] 06Traffic, 10MobileFrontend, 10Data-Engineering (Q3 2025 January 1st - March 31th): Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10708135 (10Jdforrester-WMF) If this code is in MobileFrontend, it won't detect mobile hits to non-MF wikis like Wikifunctions, so that... [14:40:07] Amir1: sorry for making it seem like I am being difficult :* [14:40:20] but I was wondering if there might be an alternate path [14:40:24] I would never think that, I know you well enough <3 [14:40:47] the concern I simply have is overloading dyna like this, which we really have not done [14:41:05] and then because it serves such a critical path of lookups, I am hestitant of breaking stuff in some way [14:41:26] and since bblac.k is not around, I am also not sure what he thinks about this [14:41:40] (I am basing some of this on my previous discussions with him but of course I can be wrong) [14:41:59] we have an alternative way: what if we do an alias in gsuite somehow and direct mail delivery to you. would that work? [14:42:26] I mean, we already publish MX records for both wikipedia and wikimedia.org and in theory these are subdomains of that, so I don't see why not [14:44:32] I don't know how we can set that up. It should probably go to noc@ instead of my account but yeah. If we can set up something like that somehow, it'd be great [14:44:57] right now, it can't even find a mail server for that domains [14:47:56] 06Traffic, 06Data-Engineering, 10MobileFrontend: Add ismobile attribute to X-Analytics header - https://phabricator.wikimedia.org/T390924#10708340 (10Ahoelzl) [14:49:57] Amir1: as I understand it (please double check :)), incoming mail aliases can be created via gsuite and then mail should first hit those before checking against the aliases defined further down below [14:50:27] what I don't know if you can do subdomain based routing without domain verification (which again, depends on what kind of verification they want) [14:50:33] _or_ [14:50:46] we can wait for bblac.k to come back on Monday and do the dyna MX record [14:52:38] I'm not sure if we have the gsuite alias check for any domain beside wikimedia.org and wikipedia.org, maybe it would still work for subdomains though [14:52:52] let's wait until Monday, not urgent [14:54:29] ok [14:56:02] fyi, I am merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133932 and thus removing from LVS a service. I 've already set it to no paging and it's very very very low traffic (max 3rps at peaks) [14:57:08] the part of the mesh/discovery has been updated already (and works fine, mostly due to the fact that it doesn't go via LVS anyway due to all k8s nodes having the IP and everything talking internally in that cluster) [14:57:27] 🤠 [15:20:44] 06Traffic, 06Data-Engineering, 07Essential-Work, 10Experimentation Lab Radar: Cookie % has been rejected because it is foreign and does not have the "Partitioned" attribute - https://phabricator.wikimedia.org/T375256#10708646 (10Ahoelzl) [15:21:06] akosiaris: are you taking care of the LBs restarts? [15:21:54] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th): Migrate Benthos `webrequest_sampled_live` to feed from HAProxy data - https://phabricator.wikimedia.org/T390029#10708662 (10Ahoelzl) 05Open→03Resolved [15:21:58] 06Traffic, 10Data-Engineering (Q3 2025 January 1st - March 31th), 10DPE HAProxy Migration: Make webrequest_frontend being ingested using the in-data `dt` field - https://phabricator.wikimedia.org/T388397#10708666 (10Ahoelzl) 05Open→03Resolved [15:22:55] 10netops, 06Infrastructure-Foundations, 10Data-Engineering (Q3 2025 January 1st - March 31th): Update `netflow` retention strategy in Druid (too much data) - https://phabricator.wikimedia.org/T387839#10708693 (10Ahoelzl) 05Open→03Resolved [15:23:21] 06Traffic, 06SRE, 10Data-Engineering (Q3 2025 January 1st - March 31th), 13Patch-For-Review: Refine add_is_wmf_domain TransformFunction fails if no source field exists - https://phabricator.wikimedia.org/T383914#10708700 (10Ahoelzl) 05Open→03Resolved [15:26:13] vgutierrez: yes [15:26:18] akosiaris: thx [17:26:19] vgutierrez: fabfur https://phabricator.wikimedia.org/T360589#10709525 thumbnail steps, does my analysis makes sense here? once it's fully rolled out and caches warmed up, I think the hit-front ratio and overall hit ratio should go super high [17:27:45] Amir1: btw I summarized the challenges in https://gerrit.wikimedia.org/r/c/operations/dns/+/1133974/comments/45de6d6c_52ed89d6 [17:28:20] and suggested an alternate path; now bblac.k can have some context when he comes back and suggest [17:28:42] Thank you sukhe <3 [18:35:05] 06Traffic, 06Data-Engineering: Unable to save Jupyter Notebooks or start IPython kernel on stat1008 - https://phabricator.wikimedia.org/T390959#10709949 (10CDobbins) [19:56:56] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710364 (10VRiley-WMF) Hey @Vgutierrez we have recieved the NIC. Is there a specific time for us to install it? [19:57:35] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10710373 (10VRiley-WMF) a:03VRiley-WMF [20:18:02] 06Traffic, 06SRE, 13Patch-For-Review: Create provisioning and post-provisioning checks for Traffic hosts to confirm validity of varying hardware configurations - https://phabricator.wikimedia.org/T378724#10710479 (10CDobbins) On 4/2, we discussed the merits and pitfalls of the proposed implementation with @V... [22:47:00] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10710968 (10Jdlrobson-WMF) @Ladsgroup let me know if and how I can help with this, but untagging web team.