[03:03:57] 06Traffic, 06SRE: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585757 (10Grand-Duc) I just tested uploading a photo of 17,6MB, and the effect (getting "Service Temporarily Unavailable Our servers are currently under maintenance or ex... [03:18:55] 06Traffic, 06SRE: Reproducible blocking error using the basic upload form, no upload possible - https://phabricator.wikimedia.org/T387007#10585780 (10Grand-Duc) FYI, my test subject was this image: https://commons.wikimedia.org/wiki/File:Englischer_Garten_Meiningen,_Gruftkapelle_-_2020-04-29_HBP.jpg The actual... [09:35:39] 06Traffic: Remove katran blockers for low-traffic non-k8s based services - https://phabricator.wikimedia.org/T373020#10586303 (10Vgutierrez) [10:50:11] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10586474 (10cmooney) I don't think it should matter to have the same setting for all interfaces on the box. As I understand it we can... [11:05:41] Traffic team FYI our bandwidth from codfw to eqsin over the direct path (Arelion E-LINE transport) has been increased to 6Gb/sec now [11:05:48] I've increased the shapers on our routers accordingly [11:08:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [11:08:12] this should hopefully give us the head room to not drop any packets on this path under normal usage levels [11:11:25] topranks: <3 great news, thanks [11:18:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [11:44:11] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10586593 (10akosiaris) > Needs to have rp_filter off (0) or in "loose" mode (2) as pods want to send packets from the service VIP, whi... [11:57:34] 06Traffic, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956#10586605 (10cmooney) >>! In T352956#10586593, @akosiaris wrote: > This isn't true. Pods do not see the service VIP ever. Traffic reach... [13:34:26] 06Traffic, 10Data-Engineering (Q3 2024 January 1st - March 31th), 10DPE HAProxy Migration: [HAProxy migration] Some 200 requests in VK are logged as 400 in HAProxy - https://phabricator.wikimedia.org/T387451 (10JAllemandou) 03NEW [13:38:23] 06Traffic, 06Data-Engineering: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454 (10JAllemandou) 03NEW [13:59:40] 10Wikimedia-Apache-configuration, 06serviceops, 06SRE, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10586951 (10Gehel) [14:05:36] 06Traffic, 06Data-Engineering: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10586980 (10Fabfur) [14:52:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [15:07:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [15:12:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [15:22:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [15:33:45] 10Wikimedia-Apache-configuration, 06serviceops, 06SRE, 10Wikimedia-Portals, and 2 others: www.wikipedia.org: prefilling the search box with the "search" URL parameter does not work - https://phabricator.wikimedia.org/T318285#10587407 (10Pcoombe) 05Open→03Resolved a:03simon04 `search` is working a... [16:07:54] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10587663 (10Ahoelzl) [16:08:07] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10587664 (10Fabfur) Given that it's just 4 bytes more, I think we can add this (I would do after we complete the migration, given that is a change on how we manage t... [16:36:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10587904 (10cmooney) [18:14:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [18:34:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [18:38:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [18:48:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [19:29:52] hey traffic folks, we are using Varnish to cache mediawiki like the wmf do. I have created https://issue-tracker.miraheze.org/T13282 for us to do. I believe the WMF used to or still does use the url to determine what cp you hit. Am I correct and is there docs on how you do it? [19:34:17] RhinosF1: currently, we don't use hostnames or URLs to pick caches for increased hitrate. [19:34:39] RhinosF1: what cp server you hit depends on the IP address and where LVS decides to send you based on the hash [19:35:03] RhinosF1: our older design was basically: we have L4LB (e.g. pybal, or any other TCP-connection-level balancer) hash traffic into a set of varnish nodes based on hashing the client IP (so the same IP sticks to the same varnish cache). [19:35:44] and then we had a second tier of caching behind those varnishes, with larger disk storage, and we hashed on the url to pick which backend-layer cache to use if the frontend (smaller, memory-only) varnish cache was a miss. [19:36:22] tht frontend and backend cache layers were all running on the same underlying hosts. [19:36:26] What was doing the hashed on the url to pick which backend-layer cache to use [19:37:29] varnish code in the frontend was, using the varnish director that can hash on abitrary strings to make a backend selection (in our case, the URL) [19:37:42] varnish. and it still does, on non single-backend sites (only codfw and drms) [19:38:16] oh true, that is still live on two sites, as we transition [19:38:22] bblack: do you know where in the wmf's code I could find the bit of your varnish config that does that? [19:38:25] it's the "shard" director we use to do that [19:38:39] also why did you decide to do that [19:39:06] To not do that * [19:39:09] And migrate away [19:39:24] RhinosF1: templates/wikimedia-frontend.vcl.erb. line 497 [19:39:29] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/wikimedia-frontend.vcl.erb#495 [19:39:56] our varnish puppetization is crazily complicated, sorry :) [19:40:54] That's fine [19:41:08] RhinosF1: for what bblack was describing above, the full history is in https://phabricator.wikimedia.org/T288106 [19:41:16] "experiment with single backend CDN nodes" [19:41:20] the history, how, why, etc [19:41:22] not all requests are cacheable, and this in general creates scaling problems in varnish in the face of random internet traffic. [19:41:43] so you get into the whole hit-for-pass vs pass thing, esp if only the application yer knows whether some requests are cacheable. [19:41:50] s/yer/layer/ [19:42:47] even in the case where hit-for-pass is working correctly, if we naively just shared all the traffic towards the backend layer by URL, some uncacheable URLs are wildly popular and will focus all their traffic on a single backend pointlessly, from the whole varnish cluster. [19:43:43] so that's why we also have backend_random in that puppetization. We try to use a random director when we know it's pass-traffic, instead of the shard director we use when we think it might be cacheable. both point at the same set of backend servers. [19:44:14] That makes sense [19:45:25] but even given that: there still remain cases where you get undesirable focus of uncacheable reqs all into one backend node. We had some killer examples of this e.g. in cache_upload (which does the commons media traffic), just due to file size limits at the various layers, etc... and someone would hotlink a certain commons image of a certain size in some wildly-popular phone app somewhere, and [19:45:31] all the traffic would hit one backend and melt it. [19:46:11] also, when external traffic patterns cause such a problem, it tends to be self-reinforcing and escalate to a broader outage [19:47:04] first those requests all focus into the backend on Node123 and effectively kill it, then it fails the varnish-frontend healthchecks, so they remove it from consideration as a backend and re-point the next batch of these requests at the next one in line. It just basically runs down the list of backend nodes and laser-focus-destroys them one by one :P [19:47:45] so we stopped hashing on URL entirely in our newer design [19:48:19] now each cache node still has a frontend cache and a backend cache on it, but a frontend just sends all traffic to its node-local backend cache. [19:48:53] I think a cascading outage for us is less of a risk anyway [19:48:54] the reason we hashed by-url in the first place was to effectively multiply our storage size at the backend layer. If you have 8 nodes with 1TB of disk in each and hash on URL, you get a collective 8TB of effective cache space. [19:49:25] so we just increased the disk sizes we were willing to buy, and stopped hashing on URLs. Basically we traded $$ for fewer operational risks/problems. [19:49:29] Because we have so few cache proxies that if one goes offline, it's going to be noticeable anyway [19:49:44] Simply by the sudden extra load they'll take [19:49:48] yeah [19:50:11] I think given we're limited by a smaller cache size, it's better to make best use of it [19:50:14] you may not really even need two-tier storage, depending on traffic volume and wiki dataset size [19:50:37] Cause if you've got 1TB of cache but traffic goes randomly, you really only have 500GB [19:50:57] Because we've got 2 cache proxies [19:51:14] two-tier storage itself was a problem. We used to have both layers as varnish. But varnish's "file" storage is non-persistent across reboot/restart, and varnish's old "mmap" backend for persistence was allowed to bitrot when the commercial Varnish guys invented the commercial-only MSE backend storage engine. [19:51:33] If one goes down, the other struggling is kinda of less of a worry because it's just not that redundant [19:51:33] so now we use Apache Traffic Server for the backend disk cache layer [19:51:51] Ye we probably don't need two tier storage [19:52:02] We just need the ability to direct traffic at the first tier [19:52:46] Well technically we have two tier caching anyway because we use Cloudflare for security and that's doing some caching of very basic static assets (but nothing from mediawiki itself because it didn't work very well) [19:52:50] well, if you want to have a single tier of varnishes, and have something hash-by-url into them to multiply your storage size, then that something has to be an L7 balancer/proxy that can see and hash on the URL. [19:53:12] e.g. haproxy perhaps [19:54:23] Haproxy could work if it can do that [19:54:43] We have haproxy as an option [20:00:43] Thank you bblack sukhe, that was very useful [20:00:59] np! [20:03:21] RhinosF1: np! and yes, haproxy can do that at least last we looked and given it's a much better experience than varnish, we even considered replacing varnish with haproxy for caching. of course it's not possible for us right now (happy to discuss that) but it's a _thought_ [20:03:44] there isn't a task right now because brett is making good progress on Varnish 7 upgrade so we are sticking to it but there might be at some stage [20:05:22] sukhe: I may come back with more questions [20:05:37] It I'll discuss what was said so far with my experts on that side of stuff [20:05:56] cool! [20:40:15] 06Traffic, 06Data-Engineering, 10DPE HAProxy Migration: Add HAproxy termination field to webrequest - https://phabricator.wikimedia.org/T387454#10588919 (10Fabfur) [21:49:09] FIRING: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX [22:04:09] RESOLVED: LVSHighRX: Excessive RX traffic on lvs2013:9100 (eno12399np0) - https://bit.ly/wmf-lvsrx - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=lvs2013 - https://alerts.wikimedia.org/?q=alertname%3DLVSHighRX