[07:50:51] hello folks, final clean up for the kafka logging clusters (need a restart of kafka on all brokers0 [07:57:22] ack [10:37:27] kafka-logging eqiad all restarted, nothing noticeable [10:37:44] I'll do codfw either in the afteroon or tomorrow morning (via cookbook since all worked fine in eqiad) [10:38:02] Great stuff. Thanks elukey. [10:38:08] <3 [10:38:35] <3 [10:40:31] <_joe_> moritzm: I'm having yet again a problem with debdeploy [10:40:49] <_joe_> it says "already up to date" for nodes that are not up to date [10:41:13] <_joe_> I am wondering if the version of the php package is confusing debdeploy or what [10:41:26] <_joe_> if I run apt-get upgrade on the servers, the package will be updated [10:42:17] <_joe_> so I am going to do the installation "manually" because it also allows me to batch it as I like it [10:43:04] _joe_: can you leave one host with the current version to test? and possibly the debdeploy config file you used? [10:43:15] mo.ritz is afk for a bit right now [10:43:58] <_joe_> volans: I encountered the same issue last time [10:44:09] <_joe_> and no, I can't leave a server un-upgraded for long [10:44:18] <_joe_> because of this specific upgrade [10:44:31] <_joe_> uhm well I can depool it [10:44:39] <_joe_> or leave an mwdebug out [10:45:48] _joe_: apt-get update has been run on the hosts right? [10:45:58] or at least 30m have passed since the addition to apt.w.o [10:47:16] <_joe_> volans: 1 hour has passed [10:47:20] <_joe_> more than, actually [10:47:27] <_joe_> I did check all the obvious stuff :) [10:47:43] a hostname that didn't got updated? [10:48:00] <_joe_> the only one I left out is mwdebug2002 [10:48:16] <_joe_> I am doing it via cumin everywhere else [10:48:32] _joe_: any hostname that debdeploy reported as not updated, to check logs [10:48:36] no matter current status [10:48:47] <_joe_> mwdebug2002, mw1415 [10:49:17] thx [11:14:45] XioNoX: topranks: hello! I am trying to debug why the bfd session is down for the new dns4003 host, IP: 198.35.26.7 [11:15:15] things I have ruled out: bird config (at least as to how it compares to say dns4002), anycast-hc issues (prefixes correctly advertised) [11:15:45] running [11:15:45] sukhe@cr3-ulsfo> show bfd session address 198.35.26.7 extensive [11:15:58] shows it to be down but that's expected but [11:15:59] > Session type: Single hop BFD [11:16:06] this is not correct? [11:16:55] anyway, I thought I should run it by you first if I am missing something obvious and then if not, I can try to go deeper (to the extent I can debug!) [11:18:08] I did try clear bfd session address to no effect [11:19:18] * topranks looking [11:19:40] output from birdc show bfd sessions [11:19:40] IP address Interface State Since Interval Timeout [11:19:43] 198.35.26.193 --- Init 21:03:22.634 2.000 6.000 [11:19:46] 198.35.26.192 --- Init 00:44:01.206 2.000 6.000 [11:19:49] which means something is broken but that's not surprising :D [11:20:15] Hehe [11:20:50] I wonder could it be the direct/multihop issue [11:21:24] Can’t recall where that popped up before, but BFD uses a different port for “multihop” versus direct (for reasons I won’t get into here) [11:21:33] I tried looking into that [11:21:43] And if either side are confused about which is in operation it can go down [11:21:44] Ok [11:21:57] let me get the output [11:22:04] but then of course, my understanding of that is limited so [11:24:11] eh, JUNOS does not save the command history [11:24:13] now I forgot what I ran :P [11:24:23] sorry - only catching up on your above messages [11:24:33] np, and this is not urgent [11:24:35] so please take your time [11:24:48] just sharing it here before it gets late for you and I get busy with breakfast and other family stuff [11:24:58] cool, but yes I think you are right looking at the single/multihop angle [11:25:11] :) [11:25:58] ok I ran: [11:26:02] show protocols bgp group Anycast4 [11:26:11] but that of course looks OK and also because of the fact that dns4002 (working host) [11:26:30] 198.35.26.8 Up 0.900 0.300 3 [11:26:59] sukhe@cr3-ulsfo> show bfd session address 198.35.26.8 extensive [11:27:02] Session type: Multi hop BFD [11:27:26] so something is definitely specific to the new host and not our config (again, not surprising but an additional confirmation) [11:28:19] hmm so yeah it's the single/multi-hop [11:28:25] I added this to the config: [11:28:28] cmooney@cr3-ulsfo# set protocols bgp group Anycast4 bfd-liveness-detection session-mode multihop [11:28:39] forcing it to use multi-hop, and now it has come up [11:28:42] wow! [11:28:43] magic! [11:28:43] hah [11:28:46] interesting [11:28:51] mmh, I can't reproduce, on mwdebug2002 running "debdeploy-deploy --source php7.4 --updatespec debian_10_1:7.4.30-3+0~20220627.69+debian10~1.gbpf2b381+wmf1+buster4" (which is what the server calls on the client" updated the packages to buster4, looking some more [11:28:54] so the key being: bfd-liveness-detection [11:29:18] but that's not in the automation config, I'll need to discuss with XioNoX about whether we should add this [11:29:36] Without explicitly setting the "session-mode" I'm unsure what logic JunOS uses to decide if a session should be single or multi-hop [11:29:36] I think this is also in one of the tickets IIRC? it was last night so I don't remember [11:29:43] interesting [11:29:56] And it seems to be non-deterministic [11:30:18] Given with the same config dns4002 was up and in multi-hop mode [11:30:23] cool! thanks for the help topranks! <3 [11:30:29] yep, so it definitely was this [11:30:42] np! [11:30:54] and yeah I think it makes sense for this to be in the automation config but I will leave that to you and Arzhel :) [11:30:59] leave it with us it should be relatively easy to fix one way or other [11:31:01] fo me at least, this is the first time I ran into this [11:31:41] the complicating factor is on the L3 switches (i.e. eqiad rows E/F, drmrs), the session is going to be single-hop I think. But we can work around that. [11:32:55] Also FYI Arzhel is going to be looking at our setup for BGP to hosts over the next while with a view to streamlining/simplifying config for other teams if possible. [11:33:11] so a "hacky fix" might be the way to go for now, pending that review [11:34:35] yes for drmrs, it is single-hop but I wasn't expecting for this :) [11:34:40] np, I think this works for now [11:34:49] I believe the source of this difference is that BFD is implemented in hardware in many NICs/ASICs, for very fast operation offloaded from device CPU. And that could consume all traffic to the single-hop port. [11:34:59] we might reimage another dns host but I am happy to do it manually (running the live command above) [11:35:01] So multi-hop, which should flow through a router, was given a different port to avoid that [11:35:06] I will add it to the anycast wikitech page as well [11:35:48] the command above is configuration (not just a once-off command), so we will probably integrate it in the automation templates [11:36:06] the re-image might just work also, given I don't know right now what makes Junos decide it should be single versus multi [11:43:29] thanks! [12:27:36] yeah, afaik BGP will establish with BFD down, and work fine. [12:27:45] It's only if BFD is up, and transitions to down, it will take BGP out. [12:30:29] yep [12:47:20] SREs, in 15min we're going to start the re-cabling of eqiad row C, no impact expected but 1/ let me know if you see something funky, and 2/ please refrain from doing any impactful/risky change in the next 1 to 2h [12:48:18] XioNoX: ack, thanks for the heads up [12:49:37] XioNoX: Thanks. Eventful/risk just in row C, or anywhere in eqiad? I have some decoms to run in other rows, but I can wait if that's preferable. [12:50:44] btullis: decoms are fine. Risky changes anywhere to avoid confusion and having to troubleshot two outages at the same time :) [12:51:13] Ack, thanks. [13:04:13] XioNoX: at least in this case, the versions of bird, etc are the same (same role, etc.) so no differences there [13:04:25] if there are differences on the host related to other configs, not sure [13:04:37] we will doing another reimage shortly so I will follow back [13:04:40] *up [13:05:05] sukhe: cool, maybe something to try is to bound BFD on one side and/or the other, to see if it's an order issue [13:05:16] but TIL that the BGP session can be established with BFD down [13:05:25] XioNoX: ok! [13:05:29] yeah, it's "smart" [13:19:06] Alright, we're starting the row C maintenance, chat will happen in the -dcops channel [13:20:40] volans: topranks: XioNoX: sorry for the confusion with the NEL dashboard earlier, looks like I recently saved it with bad settings by accident (tcp.timed_out, one of the best signals, being filtered out) [13:20:59] yep, we eventually find that out :) [13:21:01] no prob [13:21:04] now fixed\ [13:21:09] <3 [13:21:35] also found another dashboard saved without drmrs in trafficland [13:24:35] cdanis: no problem! NELs still proved very useful :) [13:25:14] volans: I suspect a host re-image (on new IP) is failing due to cached DNS record locally on install1003.wikimedia.org, it's stalling at partman stage [13:25:22] does that make sense or am I talking nonsense? [13:25:50] (I do know the old IP is in the local dns cache, but unsure if that would cause the reimage problem) [13:25:58] topranks: check the automation dhcp file [13:26:10] in /etc/dhcp/automation/... [13:26:20] just tree there there are very few files [13:26:54] yep that has the correct (new) IP in for the DHCP snippet, i.e. /ttyS1-115200/cloudnet1006.conf [13:27:03] and DHCP works ok [13:27:06] if that's what you mean? [13:27:24] I meant to check that the config had the correct value [13:27:35] I might be missing what is actually failing [13:27:48] we can wipe-cache the recursors if needed for that hostname [13:28:34] https://phabricator.wikimedia.org/P35372 [13:28:43] yep think that might be what's needed [13:28:58] I'll have a look after meeting, might have happened naturally by then [13:31:39] topranks: we have a cookbook for that if that's what you need [13:42:38] kafka-logging codfw full restarted as well [13:45:18] Bad news today: according to gfwatch.org test results, *.wikipedia.com has been blocked in China since Sept 30 [13:46:02] good thing we're wikipedia.org [13:47:17] diskdance[m]: wikipedia.org has been blocked for quite some time, on and off, since April 2019: https://ooni.org/post/2019-china-wikipedia-blocking/ (zh.wikipedia being blocked since 2015) [13:48:31] sukhe: Obvious I am aware of that. 😅 I meant specifically wikipedia.com in my message above [13:49:04] So you can see, the GFW is continuously blocking Wikimedia domains [13:49:07] yes but that redirects to wikipedia.org anyway and so I wasn't sure how the .com blocking affects us [13:50:09] yeah, maybe the .com thing is related to https://github.com/net4people/bbs/issues/128 [13:50:18] > The Great Firewall of China has blocked google.com and all its subdomains [13:51:41] From my testing and posts in that webpage below, that block has been recalled [13:52:00] Which... people have no idea why. That is a black box to us [13:52:40] So what I want to express is that we actually should take some practical actions to address this issue [13:53:35] <_joe_> can this discussion be moved somewhere where it is more on topic? Also - I suggest - probably a publicly logged channel isn't the best place for such discussions either [13:55:33] _joe_: I would consider this on-topic because whatever technical discussion made at last must be done by SRE members [13:56:12] And for public logging issue... As what Reedy says before, it's only a matter of time for the censor to know we are doing something [13:56:44] yeah but censorship issues are tricky and sensitive [13:56:48] s/discussion/decision/ [13:56:59] we have teams that work on this a lot, and we rarely discuss it openly in the public [13:58:04] I mean, there's lot of things we do say publicly, especially when we take [technical] action, it's not like it's all in the dark [13:58:21] <_joe_> diskdance[m]: I said "more on topic", but it seeems clear to me you're not open to suggestions, what can I say ;) [13:58:57] but it's hard for a conversation about e.g. GFW current+future policies and our reaction to them, to get very far in the open without treading into uncomfortable territory [14:01:33] _joe_: I'm if I let you feel that I am not open to suggestions, actually any suggestions are ok [14:02:10] Maybe I am getting a bit nervous about the fact that GFW is blocking more and more Wikimedia domains [14:03:57] * I'm sorry if I let you feel that I am not open to suggestions, actually any suggestions are ok [14:05:39] bblack: thank you for your response! I created a task about changing DNS resolution in China from 198.35.26.96 (which is blackholed) to something else several months ago and can I ask for any future progress of this? Thanks [14:07:52] diskdance[m]: we have no plans to do that. While input is always welcome, our actions are based on our own decision-making process (to put it bluntly: someone filing a ticket asking us to do X does not cause us to do X). It's [switching that IP] not a viable answer to any GFW problem, it's just a pointless step which would be quickly countered, in an escalation we couldn't sustain meaningfully. [15:43:14] XioNoX: something broke in netflow I think? [15:43:25] https://w.wiki/5nMU [15:43:39] name is broken but numeric works [15:43:53] https://w.wiki/5nMW bblack vgutierrez jhathaway [15:44:10] cdanis: thanks [15:44:14] yeah I was looking into it [15:44:46] source is upload-lb.eqiad.wikimedia.org unsurprisingly [15:44:55] lvs1018 is on fire yeah [15:45:03] and AWS of course [15:45:08] https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&viewPanel=5&from=now-2d&to=now [15:45:20] cdanis: it's not broken, those additional data come with later jobs, names are not real time [15:45:26] XioNoX: ohhhh ok [15:45:34] volans: ^ [15:45:42] something on AWS [15:46:18] about 8/10 IPs [15:46:59] https://w.wiki/5nMc [15:47:33] "user_agent":"datasets/2.2.2; python/3.10.1; pyarrow/8.0.0; torch/1.12.0; tensorflow/2.10.0" [15:47:39] yep [15:47:45] seems like deja vu [15:47:47] is this the same as the last eqiord issue, image fetching for commons? [15:47:51] yeah [15:48:20] this is most likely stable difussion dataset (laion) [15:48:20] I think _joe_ had a patch reach to merge, but we didn't want to do it without someone from traffic to review it [15:48:41] https://docs.google.com/document/d/1ECblDcC0g6TiEFEvka_dRuawbqKgIcANu8TmVRIxFFY/edit relevant doc [15:48:56] * vgutierrez looking [15:49:01] <_joe_> yes there's a patch from me [15:49:09] https://gerrit.wikimedia.org/r/c/operations/puppet/+/832268 (adding a few hooks to VCL) [15:49:09] https://gerrit.wikimedia.org/r/836093 (dependent on the above, adds a rough ratelimit) [15:49:31] first one got merged [15:50:33] I'd go with "^datasets" as the matching regex for the UA [15:51:02] going back to my maintenance cleanup, ping me if needed [15:52:42] any action items other than poking folks to get that patch reviewed? [16:06:54] jhathaway: patch itself seems sane, we need to check how much is gonna hurt varnish to go through vsthrottle code for every cache hit though [16:07:50] vgutierrez: makes sense, thanks [16:11:07] <_joe_> vgutierrez: that is exactly why I didn't merge it [16:12:16] is anyone actively working on that? [16:16:29] cdanis: could we merge it right now that upload@eqiad got back to normal and measure internal varnish timers [16:16:43] we're scaling this one to throttle reqrate to avoid lvs saturation, not outbound bytes right? [16:18:20] my concernsd with this approach are mostly about bytes things [16:19:07] but as a solution for the inbound lvs side, probably our best bet [16:25:50] bblack: you mean instead of a per-ip throttle? [16:27:12] yeah sorry my mid-all-staff ramblings seem incoherent on further reflection, let me try again: [16:29:17] For the specific problem I think we're looking at here, which I think is that the reqrate of this datasets UA is saturating LVS on the inbound side, https://gerrit.wikimedia.org/r/c/operations/puppet/+/836093/1/modules/varnish/templates/upload-frontend.inc.vcl.erb seems like a good idea [16:30:40] for other related upload problems where we have outbound bytes saturations first... we might want to try such a req-ratelimit if it's the only tool we've got, as a temporary mitigation? But I think the better long term answers lie in limiting the actual outbound bytes, not the incoming reqs at some guesswork of what the outbound size per req is, etc. [16:32:00] to do it in the VCL word, probably the closest thing we could do easily would be to weight the limiter by response size. [16:33:07] but we could also go back and look at whether we could do better at lower levels too, with e.g. outbound shaping+fairness both per-flow and for the whole NIC, to mitigate some of the worst spike impacts? [16:33:59] of course if we're not saturating cp NICs and not being unfair to flows, then the next problem is how we shape traffic in the aggregate to not saturate transits/peers, which seems more difficult [16:34:11] the most recent page was outbound link saturation, not LVS saturation afaik [16:34:36] yep.. impact came from outbound link saturation [16:34:41] 15:44 < cdanis> source is upload-lb.eqiad.wikimedia.org unsurprisingly [16:34:41] 15:44 < vgutierrez> lvs1018 is on fire yeah [16:34:45] because of request from that "datasets" UA (from AWS to upload) [16:34:48] ^ I was basing my assumption on those comments ^ [16:35:08] "on fire" as in handling way more incoming traffic than usual [16:35:16] yeah we didn't saturate LVS *this time* [16:35:21] so yeah, if it's not a reqrate problem but a bytes problem... reqrate can be a very rough tool to get a handle on things, but it's not really the "right" thing. [16:35:31] we have once recently, in a image hotlinking event where a thumb was hotlinked in a mobile push notification [16:35:55] so the thumb object wound up being the 'right' size where LVS saturated instead of any links (although multiple peering links got up around 70-80%) [16:36:46] the fairness problems and the disconnect between cp NIC outbound rates vs transit/peer capacity, those are hard problems to solve [16:36:59] so, I think the reqrate limit is probably the right thing to do now, given that this exact scenario with `datasets/` U-A has played out on AWS multiple times this week [16:37:05] yeah [16:37:17] in the medium turn I think it won't be too hard to do some bytes-based limiting in haproxy, at both the per-IP and per-AS level [16:37:22] it can track all that cheaply [16:37:33] we will have to teach it about ASN but that won't be hard [16:37:51] yeah, we still want the analytics though [16:38:04] we've really gotta move analytics up a daemon before we put too much more black holing up in haproxy [16:38:09] getting the limits right is still tricky, but still, saying that "one single ASN shouldn't use more than 10% of outbound line rate" seems fair and reasonable and will help in a majority of cases [16:38:22] bblack: well, that's why I proposed haproxy telling Varnish to reject the traffic, with that special header [16:38:30] but I don't think anyone has implemented that idea yet [16:39:08] even such a simple assertion, I'd challenge, will sometimes be wrong [16:39:14] (the 10% statement) [16:39:28] I think that's okay as long as you have analytics you're paying attention to :) [16:40:02] but in the meta, we keep putting piles of duct tape in place instead of solving foundational problems in a more universal way, too. [16:40:39] <_joe_> uhm [16:41:39] I'm not saying we have much choice in the short-to-medium, but it's a constant worry in the long term [16:44:45] stupid question, is there any high performance, open source L4 to 7 distributed firewall solution to shape traffic? [16:45:19] the kernel has some built in facilities at L2-4, for outbound shaping and fairness [16:45:31] we've puppetized some work in that area before, but never gone very far with it [16:45:37] <_joe_> bblack: jynus was thinking WAF I think [16:45:57] <_joe_> jynus: it's the secret sauce people pay big bucks for from the CDNs for [16:46:03] yes, not the backed but the management side of it [16:46:20] I don't know what was the industry name for it [16:46:24] <_joe_> the answer is "not that I know of" [16:46:55] doing anything centralized (for a single DC's scope) on the outbound side is hard, at least anything that would actually pass all traffic through it. [16:47:01] basicly to move all the non functional side of it outside of vcl for performance reasons [16:47:39] but you could imagine a reactive out-of-band system that could do something to mitigate [16:47:55] <_joe_> bblack: yes that is kind of a potential endgame for requestctl [16:47:59] (that tracks traffic stats and makes decisions to deploy blocking rules on juniper and/or lvs and/or cp) [16:48:05] or maybe at several layers depending on the needs (misses vs frontend traffic, tls termination, etc) [16:48:18] <_joe_> have some ML model sweep through the real-time traffic and extract abuse rules to apply [16:49:05] an I crazy for thinking that needs a dedicated service? [16:49:21] for the external image saturation case, we could also invert the ratelimits to be about the URLs themselves [16:50:08] the "mobile app causes millions of hits on one image" case I mean. If we assume referer isn't reliable enough a signal, we could do vsthrottle that's keyed on the image name, basically. [16:50:28] it's more or less "no one image should be pouring out of our systems at more than X/sec rate" [16:51:18] <_joe_> bblack: yes. cdanis has done some research into that btw [16:52:08] there's some scaling problems with just doing the vsthrottle keys against all transient upload URIs [16:52:23] but the ML ones will only request one image once [16:52:28] but we could filter it to where vsthrottle doesn't even kick in until the hitcount on a response gets beyond some limit number [16:52:38] jynus: yeah different problem [16:54:01] we have a lot of tooling vs problem-space mismatches :/ [16:54:36] (tooling at the lowest level I mean, e.g. vsthrottle as a universal solution) [16:56:40] bblack: yeah, that's one of the reason why I was proposing haproxy, it's quite a bit more flexible in what it can track [16:57:17] yeah, it makes sense in general for that reason, and it's the frontmost layer in terms of defending the rest of the stack [16:57:19] re: the research _joe_ mentioned, research is a bit generous but there is some anecdata comparison at https://phabricator.wikimedia.org/F35546836 [16:58:40] https://gerrit.wikimedia.org/r/c/operations/puppet/+/768723/31#message-e727b6f1ba658c0d9623084bda2ddf18ba21b02f [16:59:06] very long morning and I need to step away for a bit, back in about an hour [17:03:56] back on the kernel stuff, in modules/cacheproxy/manifests/performance.pp we currently define (for the cp nodes) [17:04:02] qdisc => 'fq flow_limit 300 buckets 8192 maxrate 256mbit', [17:04:41] which the interface-rps script applies at the per-queue level for our multiq cards, so for example cp1075 in "tc qdisc" will show 12 queues each running that set of fq params. [17:05:02] very little has been done in this area, those parameter are very rough guesses [17:05:29] there's other qdisc we could use, other fairness policies, the bucketing could be wildly wrong, there's no overall shape limit, etc [17:07:44] this is basically attacking one edge of the problem, from down there. how do you ensure all the various tcp flows coming out of this machines are fair to each other and the network, basically (which more targets the "few IPs downloading lots of stuff" problem than the hotlinking/mobileapp sort of thing) [17:08:50] we could even shape the whole 10G card down to a reasonable traffic level [17:09:16] (given the total outbound capacity of all the cp nodes at a site is larger than the total possible transit+peering out) [17:10:26] by doing that you're still "inducing failure" by dropping packets, but by doing it proactively with an aim towards per-flow fairness, more of the good stuff is getting through and less of the bad stuff. [17:12:13] (unfortunately, tc can't see L7, so we can't fix a widely-hotlinked/embedded image this way :/) [20:35:39] bblack: late to this, but would be interested to discuss. We are looking to introduce some QoS/categorisation on the network level. So for instance instead of dropping traffic below the 10G, it could instead be marked on the host as drop eligible, and the network could be left to decide if it ultimately needs to be dropped upstream. Lots of moving parts here, and none of it is a cure for insufficient bandwidth. But be interested to get [20:35:39] your thoughts. [20:41:34] topranks: that does give us some more options, cool! [20:43:02] at some point I also want to pick yours/ XioNoX‘s brains about egress traffic management options too (mostly out of curiosity, since it seems hard to get right at our scale with anything floss) [20:43:56] cdanis: yep, absolutely. [20:44:11] bu egress traffic management you mean right out at the edge towards the internet? [20:44:21] s/bu/by [20:45:39] this stuff isn't easy, but I think we could pull of a better integration of host-level and network-level control than many manage to do [20:54:25] topranks: yeah, either by knowing to limit traffic at the host level, or, by knowing to send one AS’s egress split between peering and transit [20:59:24] The last one is the real trick. Some of the SDN / SD-WAN stuff is built to tackle that, and also feed back performance info to the path selection. Very difficult to achieve at our scale, and for so many different destinations though. [21:00:38] On a manual level, for AS X, you can do BGP things to make it all seem equal and do ECMP. But weighted between one and the other, and adapting dynamically, is where it really gets hard. [22:21:28] yeah, it seems quite hard to do at all, and I’m sure I don’t even realize the half of it :)