[08:29:33] dcaro: sorry for late response. It's not necessarily expected that they'd increase, but that is not impossible. [08:30:27] The sustained short-term arrival rates of packets from anything connected on the ASWs is now in theory going to be higher in eqiad. [08:32:29] So that could have a knock on effect on cloudsw with more stuff arriving in, and thus needing to be sent out, resulting in full buffers/tail-drops there. [08:35:15] The change made on the ASWs needs to be replicated on all devices across the estate. We could certainly look at the schedule for that and tackle the cloudsw's first to see if it will help alleviate the issue. [08:37:44] FWIW I think the scale is off in the graphs on that Grafana dashboard. [08:38:26] I'm not 100% on the transformations being applied to the data from Graphite, but looking at the equivalent graphs in Librenms the per-second discard rate (in pps) is much lower than the numbers showing on those graphs. [08:38:26] https://librenms.wikimedia.org/graphs/to=1627979700/id=20029/type=port_errors/from=1627893300/ [08:38:36] Nevertheless there are drops so there is an issue either way. [08:52:26] thanks for checking, looking at the graphs, I think that the scale is ok, in the rrd plot on librenms, when it says 'Packets/sec' it actually means 'kPackts/sec', and it has trimmed the second to last peak (on the monthly graph) that should hit ~450kpps, while the last one is equal in both graphs at ~200kpps. The total values shown in the legend on librenms don't match the totals on grafana [08:52:28] as librenms is showing the 'delta' for the period, while grafana the 'total count', to see the same you have to go to the 'two years' period in librenms (~840M) [08:53:33] topranks: anyhow, yep, we should tackle that at some point, no rush (as it's nothing that was not breaking before), but it will be much appreeciated. Just let me know whenever you get to it to sync, thanks! [08:56:05] jbond: is there anything blocking https://gerrit.wikimedia.org/r/c/operations/puppet/+/692286 other than the site/datacenter discussion? [08:56:51] if not let's merge please, it seems like an excellent solution for T282880 [08:56:53] T282880: Revisit varnish dynamic backends mechanism - https://phabricator.wikimedia.org/T282880 [08:58:57] +1 [09:00:36] also there's no discussion, 'site' is better *g* [09:06:08] ema: FYI I think John is off for the next few weeks [09:08:46] topranks: oh thank you! [09:10:42] dcaro: ok, thanks for the feedback. I’ll need to dig deeper into the numbers. 200,000 packets is an awful lot of data to be dropping in one second. Represents a very serious problem if that’s the case. [09:11:44] ex. we have dropped >1M packets since 5am this morning (looking at the total counter) [09:12:22] for xe-0_0_2 on cloudsw1-c8-eqiad [09:13:51] topranks: happy to help if you need/want it [09:15:54] That's not great, but at 200kpps you'd hit a million in a matter of seconds, so definitely less worrying that I had feared. But I'll do some diligence to make sure I understand the numbers in LibreNMS and nothing is amiss. [09:16:43] Either way all we can do is adjust the buffer configuration, add additional ports, or replace equipment. So we should probably plan to make the config adjustment and see where we're at after that. [09:43:49] topranks: I see the issue, you are right, the values are not really per second, but per time slice [09:48:32] interestingly enough, the shape is exactly the same (for a period of 1 month), but the values are scaled 1k, I'm a bit suspicious about that xd [09:50:37] Yeah I think it may just be the scales are different, shape looks fine. I'm running a script doing a manual snmpget against one of those interfaces as a test, I'll leave it running a little while. Comparing that raw data to what LibreNMS shows should reveal what units it's showing or if any of the stats are off. [09:50:54] Fair to say we've a not insignificant number of drops however, which isn't good :( [13:57:52] apergos: o/ yt? [14:25:03] ema: do you know of anything that would have caused any issues across the upload caches ~13 hours ago? [14:25:28] we got an alert of webrequest data loss on all upload hosts for the hour of 08-03T01:00 [14:31:09] ottomata: hi! Nothing I'm aware of [14:32:07] if all hosts are affected at the same time it seems unlikely to be a varnishkafka issue on the nodes themselves though [14:33:45] all hosts across the 5 DCs? [14:33:56] yes [14:34:13] https://gist.github.com/ottomata/e59118b3f242dcc03882df5b5cd1dc12 [14:34:15] we're investigating [14:37:07] ok, this is def an issue on our side [14:37:16] the raw data has all the expected requests [14:37:19] something in our pipeline then [14:37:20] thanks! [19:50:46] Can someoen with pupet permissions abandon this patchset? https://gerrit.wikimedia.org/r/c/operations/puppet/+/598292 [19:53:31] done [22:56:02] thx