[04:00:44] Karina [08:06:31] XioNoX: o/ around? [08:06:44] I am wondering if there is anything ongoing between eqiad and eqsin [08:07:09] cc: topranks [08:08:11] the purged instances on cp50xx nodes are showing some issues while connecting to kafka brokers in eqiad [08:08:24] I see some packet loss in pings between cp5017 -> kafka-main1001 [08:09:46] elukey: I'm going to be on my laptop in 2h but you're right: https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=eqiad&var-target_site=eqsin&var-role=cr&var-family=All&from=now-3h&to=now [08:10:08] TIL about the dashboard, thanks! [08:10:21] elukey: you can drain the arelion link if you want [08:11:06] XioNoX: never done it before, don't want to cause an outage on monday morning :D [08:11:26] I can wait for you or Cathal, no problem [08:11:51] https://netbox.wikimedia.org/circuits/circuits/27/ state->drained then run homer on both sides' routers [08:12:04] elukey: I’m here I can have a look [08:12:28] could be worth an email to Arelion too [08:12:34] XioNoX: ah wow super simple [08:12:50] topranks: good morning :) [08:13:27] indeed :) [08:13:56] elukey: it is indeed simple so feel free if you want to run. I best take a look anyway though so no probs either way [08:14:29] topranks: I can try yes, before that one qs - where can I find that the issue is related to Aerlion? [08:14:32] (as curiosity) [08:15:26] in the meantime - setting https://netbox.wikimedia.org/circuits/circuits/27/ to state drained [08:16:09] various ways depending, the graph XioNoX linked is our ICMP probes, usually a good sign of some kind of “partial” issue (i.e. not down co [08:16:19] *completely [08:16:38] yeah previous experience also, not the first time it happens [08:16:55] and knowing that the arelion link is the primary one [08:17:04] ah wait I am slow today, if I see problems with pings eqiad <-> eqsin then it is the transport link that causes problems, hence I check what it is [08:17:36] yeah correct [08:17:48] eqiad-eqsin goes through codfw [08:18:18] I was about to say - homer on cr3-eqsin and cr1-codfw [08:18:26] does it sound good? [08:18:30] yeah that's correct [08:18:38] both direct sides of the circuit [08:19:15] you can click the "trace" button [08:19:20] https://netbox.wikimedia.org/circuits/circuit-terminations/69/trace/ [08:19:21] yep, sry that's the missing piece you need to know what sites it links [08:20:01] wow totally different since the last time I worked on it [08:20:01] one day I'll write a cookbook [08:20:11] I actually started locally [08:20:11] feels like being in the Minority Report movie [08:20:17] (compared to before) [08:20:30] good job folks [08:20:50] thanks :) [08:21:15] Also this circuit is more susceptible to these kind of issues than many of our others [08:21:39] It's a virtual-ethernet circuit over MPLS, so if the carrier loses part of the normal path it will re-route [08:21:53] and those kind of events can mean the carrier suffers congestion, we see this kind of thing [08:22:08] *most* of our circuits are unprotected wavelengths, meaning if path of the overall path is cut it just dies [08:22:12] doesn't re-route automatically [08:23:54] thanks :) [08:24:04] so homer shows me this diff for cr1-codfw [08:24:04] - metric 2000; [08:24:05] + metric 5000; [08:24:17] interface xe-1/0/1 [08:25:29] that seems correct from what I can see [08:25:38] yep that makes sense [08:25:42] shall I proceed codfw then eqsin? [08:25:43] just adjusts the OSPF cost [08:25:46] please do [08:25:46] (with the commits) [08:25:47] ack [08:25:48] thanks :) [08:26:51] thank you folks! Learning a lot [08:28:26] {{done}} [08:29:46] ok I see zero packet loss now in pings between cp5017 and kafka-main1001 [08:29:53] Nice! [08:30:08] and purged metrics are recovering afaics [08:30:20] Yes the traffic should be going from Eqiad to Ulsfo and across the Pacific with another carrier now :) [08:32:15] purged is happy, all recovered afaics [08:34:12] great. I opened T337220 and will chase up with Arelion [08:34:45] On past experience it's probably some fibre cut that's pushed traffic between the sites over a new path which is congested [08:38:36] makes sense yes [09:38:07] hi all back from vacation. just making my way through my bac log but if there is anything you would like me to push to the top feel free to ping [10:14:34] XioNoX: can you merge https://gerrit.wikimedia.org/r/c/operations/dns/+/914751 I'm on the airport with a bit of a spotty internet connection. Should be very safe to merge. I agreed with volans|off on getting it merged this monday [10:23:09] arturo: let me have a look, I think Arzhel has spotty connectivity too [10:24:04] seems ok, I'll take care of it [10:26:06] elukey: FYI it seems the drops on the eqsin link may have been usage related on our side [10:26:28] a little bit "boy who cried wolf" probably jumped too quick to assume carrier issue because we've had that a bunch of times on that circuit [10:26:51] I've un-drained it now, but we still have high-ish usage. I'm keeping an eye and will flip back if it seems needed [10:27:00] Do let me know if you observe any app-layer problems [10:29:32] thanks topranks !! [10:31:01] ack! [10:41:14] I had a quick look through https://turnilo.wikimedia.org/#webrequest_sampled_live/ but don't see any smoking gun on why we have those spikes [10:43:43] but there is indeed correlation, even though we're supposed to have a 4G CAP, so maybe it's still (or back at) 2.5G? [10:44:15] might be worth asking Arelion [10:47:02] XioNoX: yeah I asked them already [10:47:23] the 5-min averaging may be smoothing out bigger spikes, but yep doesn't look like it's saturated all the time [10:47:45] right now usage seems ok and no drops, I also was looking in turnilo but didn't turn up anything yet [10:48:24] Hopefully the carrier can confirm their policer is set to the right level, and if they have counter drops on it [13:30:22] topranks: just noticed that around an hour ago there was another round of issues for purged in eqsin, auto-resolved [13:32:08] elukey: yeah, I was on lunch but looking I see there was another burst of traffic [13:32:28] we should be able to get more throughput out of the circuit, carrier have confirmed settings are ok [13:35:23] elukey: for now I think I'll drain it again, at least until we can verify what's causing the bursts in traffic, and see if we can do anything else our side to help better use this link to its full capacity