[05:09:09] <_joe_> cdanis: I thought quite a bit about composable patterns - they wouldn't simplify the resulting DSL but rather hide complexity from the operator so I preferred not to go there [05:09:58] <_joe_> OTOH, I've been considering not forcing actions to use pre-determined patterns but to be able to define them in the expression directly.... but I'm slightly worried about that [08:36:45] there is any ongoing work happening on kafka in eqiad? [08:43:06] vgutierrez: any specific cluster? main-eqiad? [08:43:47] actually... main-codfw [08:43:57] we are seeing lag issues on purgeed@eqsin [08:44:02] *purged [08:44:18] and I can't pinpoint those to network issues on the transport links [08:44:28] <_joe_> only eqsin? [08:44:33] <_joe_> no ulsfo or codfw? [08:44:49] <_joe_> magru connects to eqiad, right? [08:45:32] I don't see anything in main-codfw's metrics atm [08:45:43] _joe_: yeah, just eqsin [08:45:48] and yes, magru connects to eqiad [08:47:00] <_joe_> looks like you had one specific episode where lag increased linearly with time, which would indicate all consumers lost the ability to consume from kafka [08:48:34] <_joe_> but looking at instance drilldowns, that looks like saturation (of what, remains to be seen) [08:48:57] <_joe_> https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=2025-07-10T07:20:54.059Z&to=2025-07-10T08:47:31.052Z&timezone=utc&var-site=eqsin&var-cluster=cache_text&var-instance=cp5017&viewPanel=panel-36 [08:49:16] <_joe_> that rtt to k-m2007 doesn't look good [08:52:02] 2007? [08:52:14] I see 2009 and 2010 in the tens of seconds [08:54:57] <_joe_> sigh yes I confused colors [08:56:43] <_joe_> in ulsfo, things look ok OTOH [08:56:45] <_joe_> https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=2025-07-10T07:20:54.059Z&to=2025-07-10T08:47:31.052Z&timezone=utc&var-site=ulsfo&var-cluster=cache_text&var-instance=cp4037&viewPanel=panel-36 [08:58:53] topranks, are you around? [09:01:34] mtr looks healthy BTW https://www.irccloud.com/pastebin/ICkCEMun/ [09:05:25] so the issue with kafka-main2010 seems to be spread across the whole cluster [09:05:40] I'm gonna restart purged in a single host [09:05:45] the trace looks good yeah and latency isn't bad given the distance [09:05:51] !log restarting purged on cp5017 [09:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] still super high [09:08:41] it came back at 9s while the others are under 300ms [09:08:54] and back alredy to 20s [09:08:56] *already [09:12:12] codfw and ulsfo are healthy so it's some how related to network connectivity [09:12:30] the 16 instances in eqsin seem to be showing the same kind of latency against kafka-main2010 [09:12:46] <_joe_> I would check how librdkafka measures that "average RTT" tbh [09:19:38] _joe_: it looks pretty solid TBH, it saves the timestamp of each request sent to the broker and correlates the response by ID to measure the time [09:21:58] not sure it's anything on the network tbh [09:23:31] so the alternative is some kind of saturation in kafka-main2010? [09:25:03] perhaps... unless you see some evidence of a network issue? [09:25:08] cmooney@cp5017:~$ time curl https://kafka-main2010.codfw.wmnet:9093 [09:25:08] curl: (52) Empty reply from server [09:25:08] real 0m0.806s [09:25:08] user 0m0.020s [09:25:08] sys 0m0.000s [09:25:37] obviously kafka being ok from elsewhere doesn't fit it being an issue with that [09:25:50] <_joe_> is it possible all offsets of the consumer groups in eqsin are lagged somehow and that's causing strain? [09:34:33] hmmm I'm getting pings over 1second to kafka-main2010 [09:34:54] and same with cr1-codfw [09:35:39] not that often... but those are happening in an open session of mtr [09:36:15] I still see 200ms roughly [09:36:18] https://www.irccloud.com/pastebin/CWKRzExG/ [09:36:26] and even at 1s that wouldn't explain these 20s stats [09:37:10] nothing on the network has buffers big enough to cause a 20 second delay, only way we'd see that is if traffic was going around the planet 100 times [09:39:13] outliers in ping response should be expected, and are likely delays on the systems at each end in sending/processing/responding [09:44:01] https://www.irccloud.com/pastebin/5ABaQmiA/ [09:51:30] _joe_: lag for partitions 3 and 1 of eqiad.resource-purge topic are quite high for eqsin instances [09:52:31] https://www.irccloud.com/pastebin/fSAVPdCw/ [09:52:34] <_joe_> vgutierrez: we could decide to reset the offsets for those instances of purged. I don't remember how that's done but I'm sure there's something on google [09:53:03] --reset-offsets --to-latest [09:53:34] <_joe_> that will mean losing a few purges, but well [09:53:47] <_joe_> I'd argue a purge lag has worse overall effects [09:54:08] event lag is quite ok at the moment [09:55:15] <_joe_> again mine is just a guess, I'm not sure that's what we're seeing [09:55:42] <_joe_> also I'm going into a sequence of meetings now, so you'll need someone else to help with it [09:55:55] <_joe_> elukey: WDYT, is my hypothesis way off? [10:00:34] I can test with cp5017 as see what happens [10:02:47] GitLab needs a short maintenance restart at 11:00 UTC [10:04:52] !log resetting eqiad.resource-topic offsets for cp5017 consumer group [10:04:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:54] hmm that had the opposite result :) [10:10:27] ok.. we had a nice bump again the whole cluster :) [10:18:49] _joe_ I was in a meeting sorry, lemme read the backlog [10:25:15] vgutierrez: what is the kafka topic that purged reads from eqsin? [10:25:34] both eqiad.resource-purge and codfw.resource-purge [10:25:41] that bump for kafka2010 is weird, I am wondering if there has been a rebalance or something for those topics [10:25:48] lemme check some metrics [10:28:07] the topic is seeing big jumps in msg/s https://grafana.wikimedia.org/d/000000234/kafka-by-topic?from=now-6h&to=now&timezone=utc&orgId=1&var-datasource=000000005&var-kafka_cluster=main-codfw&var-kafka_broker=$__all&var-topic=eqiad.resource-purge [10:28:35] but it happens regularly, at least the past 2ds [10:30:32] nothing extremely weird that I can see in the kafka2010's logs [10:33:34] tried to run an election to rebalance the partition leader [10:33:37] *leaders [10:36:13] <_joe_> RT says this election is rigged [10:36:43] vgutierrez: sorry if you already discussed this, but afaics at 10 UTC all cp5xxx clients started to observe a high lag, and a reduction of traffic towards kafka [10:37:31] do we have a way to understand if purged is processing the events correctly? [10:38:41] the other thing that I can think of is to restart kafka on kafka2010, that seems to be the one showing the worst lag starting from 8 UTC [10:39:05] yep.. the purged generated requests [10:39:23] so https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=now-3h&to=now&timezone=utc&var-site=eqsin&var-cluster=cache_text&var-instance=cp5018&refresh=1m&viewPanel=panel-4 [10:39:40] and of course the number of kafka messages processed: https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=now-3h&to=now&timezone=utc&var-site=eqsin&var-cluster=cache_text&var-instance=cp5018&refresh=1m&viewPanel=panel-32 [10:40:53] okk perfect [10:42:03] I don't explain that kafka bytes sent drop though [10:42:13] is it from the rdkafka's perspective? [10:45:23] yes [10:47:48] this has been an issue just since this morning right? we have been doing a bit of work with mobileapps and changeprop which could have caused some churn in some topics over the last few days, but I couldn't explain it affecting a single DC [10:50:35] yep yep exactly [10:55:27] to me the weirdest graph is this one https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=now-2d&to=now&timezone=utc&var-site=eqsin&var-cluster=cache_text&var-instance=cp5018&refresh=1m&viewPanel=panel-38 [10:57:08] that matches with the RTT increases [10:57:17] but there is no sign that this is happening elsewhere [10:59:29] hmm [10:59:35] let me try capturing traffic there [10:59:42] and see if I spot something weird [11:04:42] GitLab maintenance done [11:14:13] elukey: hmmm [11:14:21] something is really off with rdkafka [11:14:34] I've captured some 9093 traffic on cp5018 [11:15:10] and wireshark stats tell me that traffic for kafka-main2010 flows at 667kbps in one way and 19kbps in the other [11:15:42] nothing as low as 7 bps like grafana suggests [11:19:49] interesting [11:20:21] I checked also haproxykafka metrics, it is a different kafka cluster target but I don't see weird data [11:21:04] vgutierrez: other tests that come to mind could be to stop all purged and start them one at the time in a staggered way [11:21:25] restarting purged on cp5017 didn't help [11:22:04] yes yes this is why I am suggesting all, to see if somehow only a subset of them are ok vs all [11:22:12] but it is a long shot, not based on any evidence [11:22:24] so far it seems that the client is misbehaving [11:22:29] for some reason [11:22:39] so sudo cumin -b1 -s30 'systemctl restart purged.service'? [11:23:48] more brutal, it would require downtime - systemctl stop purged on all, then your command [11:23:58] with sleep even a little more [11:24:28] but even your restart could probably be a good test [11:24:45] yeah.. let me go with that first [11:24:50] yes let's do yours first, and see what/if anything changes [11:24:51] +1 [11:24:59] !log rolling restart of purged in eqsin [11:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:12] going to get a quick lunch, bbiab [11:36:52] rolling restart finished, didn't help :] [12:01:16] elukey: a flamegraph on cp5018 VS cp4037 doesn't suggest any weird pattern in purged.. [12:01:40] I'm gonna dump the stacktrace of the goroutines... but wtf :) [12:09:06] <_joe_> restarting shouldn't reset the offset btw [12:10:55] I'm aware of that [12:11:27] top it's the flamegraph for goroutines in cp4037, bottom for cp5018, same amount of goroutines doing exactly the same https://usercontent.irccloud-cdn.com/file/8gUcWQOb/image.png [12:11:40] I need a break to eat something [12:36:53] is there a way to increase the logging level of what librdkafka does? [12:37:15] because it may be really helpful to understand why the client behaves in that wa [12:37:18] *way [12:40:32] in theory `log_level: 7` in `purged-kafka.conf` should put it into debug mode [12:40:46] do we want to try it? [12:47:51] yes please let's do it on one! [12:54:34] fabfur: current status? are you doing it? [12:55:09] yes, disabling puppet on cp5017 and modify the conf [12:56:54] restarting purged on cp5017 [12:57:38] {{done}} [12:58:27] some PROTOERR in journal but probably unrelated? [12:58:52] aside from that I don't see differences [13:01:31] yup.. that's not relevant or we would have problems on every DC [13:07:24] elukey: do you know if kafka can throttle clients for some reason? [13:07:35] consumers in this case [13:07:43] (in a meeting) but afaik nope [13:09:59] I would re-enable puppet on cp5017 and restore rdkafka conf, apparently it's useless for this case [13:11:54] {{done}} [13:15:54] Stupid question: if I want a finger-in-the-air estimate of the size of our production estate is looking at puppetboard reasonable? 2396 nodes of which 88.1% (i.e. 2110) are physical not virtual? [13:16:11] [that %age comes from the is_virtual fact] [13:17:52] yes [13:18:02] I have used cumin * to estimate that in the past which points to the same 2396 hosts, 286 VMs [13:19:30] sudo cumin "* and not F:virtual = physical" [13:20:52] vgutierrez@cumin1002:~$ sudo -i cumin 'F:is_virtual = true' [13:20:52] 286 hosts will be targeted [13:21:55] back to the kafka thingie.. [13:22:12] I'm inclined to switch eqsin consumers to kafka-main@eqiad [13:22:20] to start discarding stuff [13:23:56] +1 [13:27:56] that's https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167860 (PCC running ATM) [13:29:18] PCC is happy, fabfur? [13:29:25] looking [13:29:44] +1 [13:30:03] sukhe: A:vms ;) [13:30:15] merge locked by sukhe [13:30:16] volans: haha thanks [13:30:16] hii, could I maybe ask about a mediawiki deployer for the current window? The usual volunteers do not seem to be around 😔 [13:30:23] volans: I knew I was missing something [13:30:41] I have to grep it all the time, missing virtual and physical :D [13:31:00] vgutierrez: done [13:32:09] <_joe_> MichaelG_WMF: I would ask in #wikimedia-releng [13:32:19] _joe_: I did. [13:32:43] but I can try slack [13:35:34] change applied in cp5017 [13:36:14] <_joe_> MichaelG_WMF: yeah your change lacks a CR, you can't expect SREs to deploy it tbh [13:36:24] <_joe_> ah I see hashar is around :) [13:36:36] CR? [13:37:32] MichaelG_WMF: code review [13:37:48] I looked at https://wikitech.wikimedia.org/wiki/SRE/Production_access and saw "To minimize risk to the sites, only a small number of people outside of the SRE team hold any production access, and that access is limited to specific systems and processes." and that let me here. [13:37:48] _joe_: the master patch is already +2'd for MichaelG_WMF's change [13:38:49] yes, it is a backport to fix a regression introduced in a change in -wmf.9 [13:39:00] <_joe_> RhinosF1: that doesn't mean a backport is automatically ok. Also thanks, we can handle this topic without your intervention. [13:39:36] fabfur: same issue.. kafka-main@eqiad in cp5017 shows the same kind of RTT [13:40:04] https://grafana.wikimedia.org/goto/pmODzjsHR?orgId=1 [13:40:37] yeah, was looking at the same panel [13:40:57] I propose to rollback if this is not the culprit, just to keep the configuration consistent [13:43:12] this is pretty weird [13:43:31] cause a side effect of switching to main-eqiad was getting a brand new consumer group for cp5017 [13:44:08] <_joe_> yes [13:44:20] <_joe_> unless [13:44:23] fabfur: the revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1167864 [13:44:30] <_joe_> doesn't purged save offsets locally? [13:44:36] _joe_: nope [13:44:52] <_joe_> I mean if librdkafka isn't doing it [13:44:56] nope [13:45:24] I had the opposite issue in the past when we had to move purged@eqsin to kafka-main@eqiad for some maintenance on kafka-main@codfw [13:45:46] we had to migrate the offsets from eqiad to codfw to avoid purged going through days of purges [13:45:47] :) [13:45:49] +1 [13:47:12] revert merged, reenabling puppet on A:cp-eqsin [13:47:55] so if this isn't somehow network/latency I don't know what's messing with purged@eqsin [13:57:27] !log restarting varnish and ATS in cp5017 [13:57:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:03] !log depooling eqsin [14:41:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:42] oncaller: we're going to depool eqsin to investigate on current purged issue [14:42:29] (and the service is being degraded in eqsin) [14:45:32] while checking https://grafana.wikimedia.org/d/be841652-cc1d-47d3-827f-065768a111dc/purged?orgId=1&from=now-12h&to=now&timezone=utc&var-site=eqsin&var-cluster=$__all I noticed that at the time of the lag increase we also increased the PURGEs to varnish/ats [14:45:58] anything special that could trigger multiple purges for the same event? [14:47:01] elukey: I don't think we increased PURGEs [14:47:21] basically it runs at a steady rate cause there is an infinite amount of PURGEs pending [14:47:24] "infinite" [14:48:09] vgutierrez: sure but either purged doesn't process them, or it reprocess the same, or they are more than before and it takes time [14:48:37] I mean I am wondering if the slowdown is client-related for some reason, not network related [14:48:40] this is why I am asking [14:48:53] and it appears from RTT etc.. that the problem is towards some brokers [14:50:16] yep [14:50:17] we could probably also try to restart kafka on kafka2010, it doesn't hurt, it may help if it triggers a reshuffle of what consumer group leaders see [14:50:24] wdyt? I can try now if you are ok [14:50:44] elukey: problem held after switching cp5017 to kafka-main@eqiad BTW [14:50:58] so I dunno if this is related at all to kafka-codfw [14:51:36] right ok I see https://grafana.wikimedia.org/d/RvscY1CZk/purged-instance-drilldown?orgId=1&from=2025-07-10T12:58:53.515Z&to=2025-07-10T14:31:28.985Z&timezone=utc&var-site=eqsin&var-cluster=cache_text&var-instance=cp5017 [14:53:36] yep [16:48:57] purged mistery continues... and it's definitely site related [16:49:52] `kafkacat -o beginning -b kafka-main2006.codfw.wmnet:9092 -G cp5017-kc4 eqiad.mediawiki.job.refreshUserImpactJob |pv -l > /dev/null` triggered 500kbps of data transmited from kafka-main2007 to cp5017, same thing running in cp4037, triggered 150Mbps of data from kafka-main2007 to cp4037 [16:50:12] (different consumer group of course) [16:50:46] just shared the two pcaps with topranks; he is looking into it too [16:50:55] thx sukhe & topranks [17:03:43] nothing specifically jumping out at me there.... the pcap on 5017 shows a single TCP flow captured mid-way through, some retransmissions but nothing major I would say (esp. given distance) [17:04:21] probably need to do an iperf test between these hosts to try to work out, or maybe between another host [17:06:46] ok thanks for checking [17:11:19] we depooled eqsin did we ? [17:11:26] yeah we did. [17:11:40] 14:41 UTC [17:13:42] the issues seems to have started around 07:58 UTC https://grafana.wikimedia.org/goto/D3whdjsHg?orgId=1 [17:15:04] we have done multiple iterations of ruling out obvious and non-obvious causes but suggestions welcome no matter how silly they are. at this stage, we will try anything :) [17:17:06] maybe I gotta eat humble pie on this one [17:17:20] :? [17:17:25] forcing the traffic via ulsfo -> eqsin seems to result in much better throughput [17:17:36] 😅 [17:17:40] as in [17:17:44] https://www.irccloud.com/pastebin/suClMKBx/ [17:17:47] and now [17:18:07] https://www.irccloud.com/pastebin/YZGeZKpf/ [17:18:54] 550kbps is what we are seeing in the kafka pcap on the data sent from kafka-main2007 to cp5017 BTW [17:19:04] VS 150Mbps in cp4037 [17:19:15] what doesn't make sense is why we don't see packet loss when we check though, and why this massive hit to backhaul throughput wouldn't mean the overall throughput on the link decreased [17:19:33] vgutierrez: can you run that test again? [17:20:00] from cp5017? [17:20:02] I will do it, it's late for him [17:20:07] oh [17:20:08] 13:20:01 <+jinxer-wm> RESOLVED: [32x] PurgedHighEventLag: High event process lag with purged on cp5017:2112 - [17:20:11] well [17:20:17] that's as obvious a sign as any [17:20:40] https://grafana.wikimedia.org/goto/2uWCpjyNg?orgId=1 [17:20:51] that certainly fixed it [17:21:04] LOL [17:21:07] that fixed the issue [17:21:25] :_) [17:23:09] site is still depooled right? [17:23:14] I can flip it back? [17:23:28] yes, it's depooled [17:23:29] yeah [17:26:02] alert triggered again [17:26:16] and RTT spiked again [17:32:25] (resolved again fwiw) [17:41:37] ok yeah I properly drained the Arelion cct to eqsin from codfw [17:41:57] I can see marginal packet loss over it if I send a udp flood [17:41:59] https://phabricator.wikimedia.org/P78888 [17:42:18] checking all the interfaces in the path I don't see any smoking gun though, no errors etc. [17:42:33] latency over it is good still [17:43:27] what I really don't understand is the throughput from codfw -> eqsin stayed normal at 8am this morning, if the performance of the circuit radically changed around then then surely it should affect the bw in aggregate? [17:43:54] which is why I was looking at all the internal DC links but I'm not seeing a problem anywhere [17:44:09] topranks: I don't have good answers but can this be related? I don't see how but I am also not the best person to comment [17:44:12] 13:14:51 <+jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: [17:44:25] 13:14:51 <+jinxer-wm> FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - [17:44:46] that is eqiad to codfw but brett pointed this out in the alerts so I am sharing [17:44:57] it's been down since yesterday, but none of this traffic was ever going via eqiad, I looked at that this morning to double check [17:45:03] ok thanks [17:45:26] > then surely it should affect the bw in aggregate? [17:45:26] and yep could potentially be affected, but Lumen CCT from codfw -> eqiad looks ok, plus it was like 3pm UTC yesterday that failed so time doesn't correlate withthis [17:46:28] fwiw we don't see any noticeable fallout of this anywhere else, just on purged. [17:47:11] like on ATS talking to the backends [17:57:01] sukhe: sry was looking at a few bits there [17:57:31] I was wondering that, what is the majority of traffic on the backhaul from codfw -> eqsin? [17:57:57] https traffic? [17:58:02] from where? [17:58:17] mw and swift vips [17:58:25] the kafka flows are long-lived TCP sessions right? that's probably different than the MW requests? [17:58:26] main difference is that all of that is ipv4 btw [17:58:44] yeah this one is-was v6 [17:59:01] ats reuses the connections btw [17:59:19] don't think the protocol is a factor, poor iperf results were with v4 [17:59:23] hmm ok [17:59:24] topranks: basically for anything not in cache and talking to MW (text) and Swift (upload). but we can also probably confirm on the interface throughput dashboard? [18:01:57] yeah that's what I thought... just trying to figure out what might be affecting kafka differently [18:02:09] nothing really is making sense to me tbh [18:02:49] the 0.7% loss is bad and will cause problems, but the whole lot doesn't add up [18:05:31] traffic team really sorry anyway, I checked a bunch of things earlier and found no smoking gun so gave the network a clean bill of health [18:05:32] topranks: I guess another question is what changed at 07:58 for this to be a problem anyway [18:05:35] clearly that was an error [18:05:50] I know you've all been at it all day as a result :( [18:05:52] na that's OK, we also ruled out network independently, so not just you :) [18:06:25] essentially for the same reasons as above that everything else was looking OK. so we focussed on what was not OK and that was simply purged. [18:06:43] and well by extension of that, kafkfa. that's how we decided to focus on the kafka host then [18:08:14] fwiw there is nothing on the kafka host at least itself to indicate any kind of saturation. and well, if that were the case, then we would see issues from other places on other topics [18:09:11] yeah I looked at the stats for it earlier too, plus the other nodes and network bits, nothing showing a fault [18:12:54] I guess the question is what should we do in the meantime [18:13:41] The active path from codfw -> eqsin now (via ulsfo) is not causing issues [18:13:46] so we may as well repool the site [18:13:59] ok. [18:14:21] I don't have nearly enough evidence to say raise a fault with Arelion on the (currently drained) primary link [18:14:24] things look good on our end fwiw [18:14:31] (purged's end) [18:14:46] and I unfortunately need to step away now for a while (something planned from a long time) [18:14:50] no worries [18:14:57] so safe to repool the site from your end is all I need [18:15:00] good to go on that? [18:15:03] I'd stay if I could affect anything but can't. I'll be back in an hour or two and will keep looking [18:15:09] +1 for the repool [18:15:13] thanks. worse case I will depool [18:15:16] see you later! [18:15:23] the break/fix is extremely clear in terms of the WAN path change fixing it [18:15:38] we flipped and you guys seen it immediately, I could see diff in iperf rate too [18:15:42] yep [18:15:48] it was pretty instant [18:15:53] so has to be it :) [18:16:00] the what is clear [18:16:11] the why is still confusing me [18:16:27] "the how" more like [18:16:27] yeah let's go over it tomrorow. run! [18:50:23] toprank.s: for when you come back, I was thinking, how much worried should we be about the Lumen link to codfw now since it's not only serving ulsfo but also eqsin? [18:51:56] I guess it should be OK given it's low-peak in eqsin right now and by the time traffic picks up, ulsfo will have levelled off and things will balance? [18:52:02] I also know we have done this in the past but yeah [19:24:11] (things are looking good so far) [19:55:07] I’m not that worried but it is certainly harder to reason about and more risky should we have an issue or attack in California [19:56:40] should be fine [20:24:02] fwiw this is the link, currently about 50% usage handling traffic for both POPs so should be more than able to handle things [20:24:03] https://grafana.wikimedia.org/goto/TnfFUCyNg?orgId=1 [20:25:57] also worth noting that if we depool eqsin those users go to ulsfo and this link has to deal with all their traffic anyway (hence the jump in usage is at 14:46 UTC when eqsin was depooled, not later when the transport path to eqsin was cahnged) [20:40:07] yep fair point.