[01:31:57] (EdgeTrafficDrop) firing: 64% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:36:57] (EdgeTrafficDrop) resolved: 66% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:41:57] (EdgeTrafficDrop) firing: 57% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [01:51:57] (EdgeTrafficDrop) resolved: 67% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=drmrs&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [04:41:09] 10Traffic, 10Data-Engineering, 10SRE, 10Trust-and-Safety, 10serviceops: Disable GeoIP Legacy Download - https://phabricator.wikimedia.org/T303464 (10odimitrijevic) [08:45:57] (EdgeTrafficDrop) firing: 56% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [08:47:52] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster [08:51:42] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster exe... [08:54:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster [09:15:57] (EdgeTrafficDrop) resolved: 65% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [09:30:47] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp5006&service=Confd+vcl+based+reload known issue? [09:33:54] also this https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=phab.wmfusercontent.org&service=HTTPS-wmfusercontent ? [09:34:33] nevermind, looks like my icinga window had stalled data :) [09:36:47] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1079.eqiad.wmnet with OS buster com... [10:02:57] (EdgeTrafficDrop) firing: 27% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:08:05] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster [10:12:57] (EdgeTrafficDrop) resolved: 0% request drop in text@eqiad during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=eqiad&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DEdgeTrafficDrop [10:33:27] XioNoX: are you aware of why BFD sessions over IPv6 are down to doh/durum VMs in drmrs? [10:33:58] just had a quick look, nothing jumping out at me, wanted to make sure it wasn't a known thing or result of some other work before digging further [10:43:28] topranks: yeah, I spent some time on it and it's quite the rabbit hole :) I need to document my findings, but as it's not a blocker I got side-tracked on other things [10:43:55] ok no probs [10:44:25] I'll wait to see what you've already discovered instead of repeating that effort. [10:44:39] Did strike me as odd that it affected different machines across different switches, and only v6. [10:45:12] topranks: currents theory is that the switch is trying to do multi-hop BFD, while the server single-hop, and the link-local vs. unicast IP is making it worse [10:46:36] with https://phabricator.wikimedia.org/T209989#5025258 on top of it, it makes many different variables to understand how each side plays out [10:47:18] topranks: but as BFD never came up, BGP ignores it and stays up [10:48:00] ah ok that was really making me scratch my head, didn't know that was possible. [10:48:36] yeah theory makes some sense. Looking at a PCAP the hop-count is 255 on packets sent either side [10:49:04] I was confused why the switch shows the remote (doh6xxx) side as being 'admin down' [10:50:08] topranks: you can see verbose traceoptions in asw1-b12-drmrs> show log bfd.log [10:51:25] for example: Mar 18 16:10:22 Packet from 2a02:ec80:600:101:10:136:0:21 to fe80::cafe:6a02:6d2d:3800 (ifl 574, rtbl 0), discr 0x0, label 0, matches no session [10:51:34] yeah I was just looking at that [10:51:56] Kind of looks like it doesn't like the link local [10:52:11] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1081.eqiad.wmnet with OS buster com... [10:52:34] XioNoX: good tip on bfd.log thanks for that [10:53:50] and asw1-b12-drmrs> show bfd session address 2a02:ec80:600:1:185:15:58:11 extensive [10:53:55] Session type: Multi hop BFD [10:54:10] lunch time, back later [10:54:23] np, and sorry was trying to nerdsnipe you :) [10:56:00] but yep, looking at the PCAP, switch is sending to port 4784 (multihop bfd), VM is sending to port 3784 (regular bfd) [10:58:01] On the back of that I tried this: [10:58:03] set protocols bgp group Anycast6 bfd-liveness-detection session-mode single-hop [10:59:01] And sessions came up on asw1-b12 [11:01:02] What I don't understand is the relation between that and the issues in T209989, in terms of having a unified config [11:01:03] T209989: Bird multihop BFD - https://phabricator.wikimedia.org/T209989 [11:01:26] I'd have thought since in both cases the is direct L2 adjacency there was no multihop, but clearly missing something. [11:14:41] 10netops, 10Infrastructure-Foundations: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) p:05Triage→03Low [11:15:43] XioNoX: FYI I created the above to document what I'd found so far [11:21:31] Hello. I have a question about the new rows E and F. Each ToR switch has 8 x 100 Gbps ports and the rest are 25 Gbps, right? [11:22:21] Is it easy to say how many of the 100 Gbps ports are accounted for by the topology and how many are intended to be available for equipment within the racks? [11:25:53] 10netops, 10Infrastructure-Foundations: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) For comparisons sake the session from cr1-codfw to doh2001 are up, and using multi-hop mode. These are similar to drmrs in that one side is... [11:26:40] btullis: In theory all the 100G ports would be for uplink to the next layer up, but we are unlikely to ever get to having 8 spine switches at that next layer tbh. [11:27:12] The non-100G/QSFP ports support 25/10/1G. Thus far we've only deployed servers at 10G or 1G I believe. [11:27:27] Although given the price point if it were all up to me I'd go with 25G everywhere. [11:27:45] We've not really had a request for 100G to end servers, but certainly that's something we could consider. [11:28:04] The other option would be bundling 25G ports in a LAG, to make 50G or whatever. Is there a particular use case you had in mind? [11:29:16] topranks: OK, thanks for that. It was that recent email exchange there was talk about using a 100 Gbps port instead of a LAG of 25Gbps ports that got me wondering. [11:29:59] It's certainly possible. There are few considerations like the available upstream bandwidth. [11:30:18] Currently we have 2x100G upstream from each of those switches [11:30:36] So if we added a few 100G server ports, and expected those to transmit 100G regularly, we might need to add upstream capacity. [11:31:25] I am currently thinking about a storage platform for Data as a Service, based on Ceph. I'm only at the stage of drawing diagrams to discuss with m'team, but it's useful to know this stuff at this stage. [11:31:45] ok cool, yeah happy to discuss in more detail, it's certainly an option [11:32:07] we'd need to work on the server spec too, in terms of what NIC we would use and making sure the PCIe etc. is going to be able to keep up [11:33:26] Yeah, definitely. The 50 Gbps LAG is also an option. For reference, I'm looking at this kind of architecture: https://image.semiconductor.samsung.com/content/samsung/p6/semiconductor/newsroom/tech-blog/all-flash-nvme-reference-architecture-with-red-hat-ceph-storage-32/redhat-ceph-whitepaper-0521.pdf [11:33:44] ...although not Redhat obvs. :-) [11:35:22] yep... skimmed briefly it seems to make sense, I don't think there would be any blockers but obviously we need to plan it all out. [11:35:38] In general I think the idea of performant nodes with fast storage, high bandwidth networking, makes sense for that [11:35:49] better to save on power and space by having high density servers IMO [11:36:09] how much does it cost one of those? [11:36:56] That I don't know tbh. [11:42:00] <_joe_> btullis: I am generally skeptical of distributed filesystems, as they encourage bad programming practices - you access blobs from a distributed storage using an api (the VFS one on linux) that is designed for access to a local resource [11:42:23] <_joe_> but, I'd discuss any plans for data storage with the data persistence team. [11:42:49] <_joe_> this doesn't seem to be the right channel to involve them in any plans [11:43:58] _joe_: Thanks. Yes you're right, I shouldn't go on about it here. I only came to ask about the network ports. [11:44:19] <_joe_> btullis: yeah I was going one step backwards [11:45:15] <_joe_> "data as a service" is a very broad term and I'm not sure the whole idea should not include the SRE team that has been designing how we store/access data until now. Unless with "data" we just mean analytics data in this context [11:45:29] <_joe_> I've seen the terminology used either way [11:46:43] <_joe_> but even then, I'd try to coordinate efforts, I kinda-remember Emperor is looking into using ceph as a backend for swift too [11:46:50] We already have a big distributed file system with HDFS. I'm just trying to think ahead and I'm not trying to do it without involving SRE. Only drawing sketches at the moment. [11:54:33] <_joe_> btullis: oh sure, sorry I should've specified that I am skeptical in the context of serving a low-latency service like a user-facing website [11:55:38] <_joe_> if you have higher tolerance for latency spikes, it's ok [11:55:48] <_joe_> which is the case for analytics workloads [12:11:28] topranks: I didn't know it was possible to force a BFD session to be single-hop, nice find! [12:39:41] nice work on the ticket indeed, topranks :) [12:42:44] master of the question-mark key right here :D [13:16:42] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster [14:00:36] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1082.eqiad.wmnet with OS buster com... [14:19:58] 10netops, 10Infrastructure-Foundations, 10SRE: IPv6 BFD Sessions Failing from Bird (Anycast VMs) to Juniper QFX in drmrs - https://phabricator.wikimedia.org/T304501 (10cmooney) Thinking about this further I think it works from the CRs because the peering is from the local public/private subnet to the loopbac... [14:44:49] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster [14:49:23] bblack: 🚀 [14:50:56] :) [14:51:09] bblack: not for now, but should we shorten the map list? eg.: [14:51:09] PT => [drmrs, esams, eqiad, codfw, ulsfo, eqsin], # Portugal [14:51:09] PT => [drmrs, esams, eqiad, codfw], # Portugal [14:51:18] I uploaded ES and FR too, but I think I'll wait until later in the day when they're lower-traffic [14:52:29] XioNoX: arguably, it makes makes sense to either cut the lists off once we cross both core sites, or cut it off at some arbitrary length like 4, which would often be the same. Needs a little thinking about scenarios and edge cases, though. [14:55:08] (also, we've never given a ton of great thought to the deeper ordering of these lists anyways) [14:55:20] we mostly focus on the first entry or two [14:56:01] maybe as part of the latency stuff we look at next Q for drmrs anyways, we could see if there's an easy way to script up getting a realistic ordering for most of them. [14:56:43] we could leverage our machine learning clusters [14:56:45] :) [14:58:34] :P [15:28:58] 10Traffic, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1080.eqiad.wmnet with OS buster com... [15:41:05] https://datatracker.ietf.org/doc/draft-ietf-ippm-responsiveness/ [15:41:14] ^ could be useful at some point! [15:41:50] TL;DR is a measurement of relatively "real-world" network performance, under loaded multi-stream conditions, etc. There's a server API we could implement on our end to be a test endpoint for such measurements. [15:42:55] it's mostly targeting the bufferbloat kinds of problems [15:43:46] in the past, we've often tuned our server-side buffering upwards just to make things more resilient in edge cases. We've lacked a good way to measure the negative impacts of those excessive buffers at various layers of our infra to put pressure the other way in our config choices. [15:44:58] ring buffers I'm less-worried about, but some of our L4-7 stuff probably has a bit more buffer bloat than is ideal :) [16:33:22] I could be reading this wrong, but are they suggesting that a maxed-out/at-capacity network is "normal working conditions"? [16:35:19] yeah but they're targetting the edges (the origin servers + client home networks) more than they are the middle networks and routers, I think. [16:36:44] in those cases, that kind of is what matters a lot of the time. The sort of scenario where the ISP advertises 1Gbps and unloaded latency tests look great, but then you've got two TVs streaming movies while someone's playing a playstation and you start a video call and it's choppy, even though you haven't really even capped out all of your bandwidth but you're still suffering. [16:37:13] it's all about traffic priorities and queueing and buffers, and mostly at the far ends these days I think. [16:37:19] hmm yeah, I guess from us to it might be a good strategy. [16:37:37] I was kind of considering between endpoints in our own DC, and thinking "we don't want to saturate all those links for a test" [16:37:37] on our server end we're also contributing to that stuff, for our traffic [16:39:12] we've got many thousands of conns coming into 1x CP servers for instance (or through 1x LVS), and there's buffering there too. [16:39:49] we've tended to pay more attention to the "increase buffering so we don't drop stuff under short-term spikes of traffic or short-term pauses of service daemons" sort of logic. [16:39:50] oh yeah, and various queues in the NICs and kernel etc. that could be tuned up/down presumably [16:40:04] and not so much to the "reduce buffering and accept a little loss for better overall performance for users" angle [16:40:38] the buffering happens all over, when you look at the total path of of our typical user traffic to the caches [16:40:59] and like how much to we buffer to keep TCP happy, rather than let TCP do it's "job" [16:41:13] there is obviously an optimal point somewhere between [16:41:32] even if we just look at the lvs+cache hosts: there's kernel/nic buffering in both, and then especially on the cache nodes, we've got basic TCP buffering to reach haproxy, etc... [16:42:01] but beyond that, when we revproxy so many times (haproxy->varnish->ats), each revproxy has some level of effective buffering as well at L7 [16:42:51] even when we're "streaming" reqs and responses, there's always some smaller-scale store-and-forward effect going on at each layer of the stack. [16:44:07] yeah it's a complex problem alright [16:44:23] topranks: yeah that's the main thing. there's an optimal-ish area in the middle, and we don't really have good metrics or ability to find it, historically. We've added buffering when it seems to help with what it helps with, but not ever tried to trim back the other direction :) [16:46:22] I've been there myself. Adding buffer can dramatically improve performance sometimes, and looks like a great 'fix' for some really poor performance [16:46:53] But you maybe end up slowing things down a little for everyone, which maybe nobody screams about, but you never go back and review [17:26:04] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [18:24:47] 10netops, 10Infrastructure-Foundations: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) p:05Triage→03Medium [18:25:04] 10Traffic, 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) PT was pretty smooth, ES likely to be later today, closer to when their daily traffic cycle begins to trend downwards. [18:26:04] bblack: congrats! [18:32:34] 10netops, 10Infrastructure-Foundations, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) The diff in the above patch is relatively straightforward on the CRs, basically the same as you can see looking at the policy changes:... [18:36:02] 10netops, 10Infrastructure-Foundations, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) In drmrs the changes are more involved, and it's a little harder to read the homer diff. Ultimately they are only small and shouldn't... [18:46:53] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [18:47:26] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [18:48:41] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney i have connected spine switches to scs and updated netbox [18:49:06] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) Hmm, the 1.21.1 build didn't work out of the box. Running `build-envoy-deb buster future` got me this: ` [...] ./ci/run_envoy_docker.sh ./ci/do_ci.sh b... [19:20:40] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Unify loopback filters between CR routers and L3 switches - https://phabricator.wikimedia.org/T304553 (10cmooney) To clarify the 'port' isn't an option on QFX even for UDP, although it allows you to define a term with that. So I've changed... [19:41:32] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @jclark-ctr super thanks for that! I'll open a task and start planning how we take care of the move. [20:01:29] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10cmooney) @Jclark-ctr I'm not getting any output on port 20 or 29 of the scs-f8. Are the two Junipers powered on? If not can you double c... [23:40:38] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [23:40:48] 10Traffic, 10SRE, 10envoy, 10serviceops, 10Patch-For-Review: Refactor envoy HTTP protocol options to new version - https://phabricator.wikimedia.org/T303230 (10RLazarus) 05Stalled→03In progress p:05Low→03Medium