[07:55:25] 10SRE-tools, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10MoritzMuehlenhoff) Since the Thanos hosts run Buster and a more recent kernel/glibc/systemd, I disabled the cleanup cron job on these hosts, so... [09:59:54] effie: hey, let me know if you want to do that test at any stage. [10:00:21] sure sure, you want row A servers right ? [10:00:38] I notice the TCP re-transmissions weren't as high yesterday as the day before. Wonder if the switchover would have caused increased traffic somehow compounding the problem. [10:01:53] Doesn't really matter which row, but I guess no need to do them all, I'd been looking at an instance in row A so yeah, let's do it there if that works. [10:03:33] ok let me find something there [10:03:41] and give you 2 servers [10:03:49] cool [10:10:24] I see retransmissions from other racks btw [10:10:30] rows* [10:14:25] topranks: mw1422 and mw1455 [10:14:28] Yeah this problem exists everywhere. Only place you wouldn't see effects from what I'm describing is TCP sessions between devices on the *same* row (i.e. they don't go via uplinks to CRs). [10:14:34] both on rows A [10:16:22] Ok great. Let me know when I'm ok to make the change, I'll change the interfaces file (to cover reboot case,) but to change the route I'll do this, which should be fairly quick but there is a time when the server has no default route: [10:16:44] sudo ip route del default via 10.64.0.1 && sudo ip route add default via 10.64.0.3 dev eno1 onlink [10:20:08] you can make a change [10:20:30] ok... I'll do mw1422 first. [10:20:33] I am not disabling puppet since we concluded yesterday that it is not needed [10:22:29] correct, should be fine to leave it enabled. [10:23:32] ping when you are done, so I can repool the servers [10:23:58] Ok change made on mw1422, no pings dropped from elsewhere in the DC. [10:25:28] Moving on to mw1455 [10:32:09] IPv4 done, I am just realizing I didn't fully consider v6, looking. [10:37:08] Ok all done. [10:38:43] effie: you can go ahead and repool [10:39:02] anything interesting ? [10:39:48] No, via v6 given how the router advertisements work it's kind of different. [10:40:21] You can't reliably remove the existing default (subsequent RA packet from the CR will re-insert it), so instead I had to create a second default route with a lower metric. [10:40:36] In terms of the performance it's too early to say, we'll really need to see in the graphs. [10:41:18] And - for context - this only will move the path outbound for those servers, the things they are talking to could have the same issue on the return path. But at very least we should see a reduction if the switch discards are the cause of the retransmissions. [10:47:12] things they are taking too probably have issue on the return path, if I understand correctly [10:47:20] that is the case of memcached servers [10:48:01] Yep. But we can't change *everything* it'd be too risky, and just doing a sub-set of the "other side" servers wouldn't eliminate it completely. [10:48:30] it is what it is :D [10:48:35] We should still get a strong signal if the re-transmissions reduce, even if they don't go away. [13:41:13] topranks: just checking before it gets out of my very little memory buffer... what do we want to do with the swifts and lldp? [15:17:05] 10Mail, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) Status update: mx2001 is reimaged to Bullseye and working fine so far. The smart hosts config on our servers has been switched to prefer mx2001 over mx1001 and... [15:17:23] 10Mail, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Upgrade MXes to Bullseye - https://phabricator.wikimedia.org/T286911 (10MoritzMuehlenhoff) [16:35:01] volans: sorry I'd not got back to it; I'm thinking maybe we just push it out to devices for now, until we get time to review properly. [16:35:58] In terms of the fix I'd begun to warm to just configuring it to happen "if local LLDP neighbours empty" [16:36:13] It'd be rare the switch LLDP would break etc., and we'd get an alert regardless. [16:36:33] But I got sidetracked trying to work out how to implement such logic. [16:36:39] So maybe for now better to just push it out. [16:37:36] ack, feel free to ping me or I can volounteer mo.r.it.z for that :-P in puppet you can just check for the value of the fact fwiw [16:45:12] what's up with LLDP? [16:59:27] XioNoX: https://phabricator.wikimedia.org/T290984 has background [16:59:47] tldr some of the NICs have an onboard LLDP agent [16:59:51] which was misbehaving [17:00:12] I understand TCP offload but not LLDP offload 😅 [17:05:59] lol [17:06:26] cdanis: do we wanna do it now on one swift host and follow up later on the others? [17:06:41] I'd rather do it now or monday, both becxause I'll not be around tomorrow and it's friday :D [17:09:51] let's do it now [17:14:56] cdanis: LLDP is a foundational block for FCoE, and NICs sometimes have features where they offload FCoE in hardware and present themselves as SCSI controllers etc. [17:15:11] ah! that does make some sense [17:29:57] cdanis: ack, doing ms-be2051 then manually [17:31:25] that's done and !log-ed [17:31:42] I'll add the cumin command to run later on on task for review [17:38:39] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) All hosts have the same identifiers: ` $ sudo cumin 'ms-be105[1-9]*,ms-be205[2-6]*' 'ls -1 /sys/kernel/debug/i4... [17:38:52] cdanis: ^^^ to get another pair of eyes [17:39:15] Thanks volans. [17:39:59] Looks good to my untrained eyes anyway. [17:41:01] Do you think it's an idea to push this out today? Or should we wait till Monday? Or risk tomorrow? [17:42:06] volans: looks good [17:43:19] how much do we want to wait? [17:44:04] i know I'm being probably over cautious here [17:48:35] if lldp still looks good from the switch side I think just proceed [17:49:02] let me check that, it was fine yesterday on the relforge hosts. [17:49:59] topranks: it's asw-a-codfw [17:50:17] check if any sign of drop connection too, but I didn't se anyting strange [17:50:23] yeah it looks good, last LLDP input was 21 secs ago. [17:50:40] Last port flap 8 weeks back [17:51:05] ack [17:51:14] * volans ready to proceed if you agree [17:51:42] for you topranks: -b is --batch and -s is --sleep, the sleep before scheduling another host [17:51:46] Yeah I'm happy enough [17:51:47] as cumin doesn't do fixed batches [17:51:51] but uses a sliding windong [17:51:54] * window [17:52:01] man it's so awesome [17:52:06] so as soon as one host finishes it will schedule the next one [17:52:25] I've made so many hacky scripts down through the years, it's like the Lamborghini of that shit :) [17:52:32] ahahahah [17:53:42] without further ado... proceeding [17:53:59] fire away [18:01:40] ahaha [18:09:14] all done, all seems good so far [18:09:46] 🎉 [18:12:41] nice [18:12:50] dupe of https://phabricator.wikimedia.org/T250367 I guess? [18:16:21] those are the only ones that were showing the problem though [18:30:00] Looks similar Xio.Nox but this seemed specific to Intel X710, or at least the affected hosts all had an Intel NIC and the fix is Intel-specific. [18:31:47] And actually in this case it was the other way around. [18:32:11] Switches could see the server details fine (i.e. Linux LLDP messages were being transmitted) [18:32:32] It was just the kernel wasn't getting the LLDP frames from the switch, as the NIC was eating them.