[09:46:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:51:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:40:27] topranks: o/ https://phabricator.wikimedia.org/T420223#11752904 - veeery weird [10:40:48] does it ring a bell? [10:52:05] elukey: just afk right now will check in a short while [10:52:27] np! Even tomorrow, nothing urgent [11:30:02] topranks: I've read the reply, make sense thanks.. What I am wondering though it the following - those worst latency timings align with the frequency of timeout that we are seeing between eqiad and codfw for mcrouter, so not necessarily a sign that the network is not performing well in the general use case, but that for super sensitive latency apps like mcrouter even outliers could count. [11:30:45] I am seeing also weird values to mc1041 from wikikube-worker1070 [11:31:47] is there a way to test if those bumps in latency are "real" or artificial? Like the router taking time to generate TTL exceeded etc.. [11:38:11] it's unrealistic on the network that we'll never have some jumps in RTT (say buffers get full due to some burst) [11:38:45] I think we need to engineer the apps to perform well even if we have occasionally higher RTT rather than try to fix at the network layer [11:39:00] there are fixes - look at the high-frequency trading world - but I'm not sure that's the way to approach this [11:40:29] elukey: it's hard to assess exactly what is causing the higher RTT. In theory if we have pcaps either side we can look at tx vs rx times for packets, and work out the one-way delay. [11:40:48] but what might be hard is getting that down to ms accuracy, given there will be some drift between the clocks on the system either side anyway [11:41:55] it absolutely could be due to buffering on the network when we have bursts though, so let's not assume it's only cosmetic due to ICMP generation [11:46:44] I added some other mtr reports and it seems also something happening within eqiad [11:47:12] topranks: you are totally right about the RTT expectations, but I am puzzled why this happens only on a subset of nodes [11:47:32] do you think it may depend by the other hosts in the rack and what they do? [11:49:42] elukey: the problem is finance will never sign off on the tens of millions you want to fix the jitter [11:50:13] one thing to bear in mind is that anything in eqiad row a/b is going to have packet drops and high jitter [11:50:36] this is due to those rows being connected at 10G to older (Trident 2) switches with low buffer memory [11:50:49] I'm not sure if that correlates with the hosts that have worse jitter [11:51:09] the opposite :D hosts with weird jitter are in D/C [11:51:12] the other thing I'd wonder is are these connected at 10g/1g or a mix? [11:51:19] well then let's just blame Nokia [11:51:22] :P [11:52:16] what we probably need is to get two sretest hosts in those same racks [11:52:22] ah wow interesting, some nodes are 1G afaics, I never tried ifstat to see what happens to them [11:52:37] and try to do some more production-grade RTT tests somehow, I'll have to look in to how we might do that [11:53:12] shouldn't matter too much 1g/10g, but if a burst of packets arrive at 10g they are cleared 10 times faster, thus less time queing for the last of those packets, thus less jitter [11:53:16] both good and bad workers have 1g afaics [11:53:24] 800ms is something insane - that *has* to be delay in one of the hosts [11:53:29] the network simply won't buffer anything that long [11:53:52] the 200ms values in the other one could be the network [11:53:54] Effie had a good idea to cordon some of the nodes that show high jitter from k8s, to see how the errors go [11:54:11] yeah that's definitely worth doing +1 [12:03:40] elukey: I'm gonna take a packet capture on wikikube-worker1070 for traffic to that host to see if I can spot anything [12:08:20] elukey: which hosts are 1g?