[09:10:03] XioNoX, topranks: so the homer check config email often fails for mr1-eqsin with timeout, is that something you've already looked into? can we extend the timeout maybe? [09:10:25] volans: didn't look, no [09:10:39] timeout where? [09:10:55] ncclient.transport.errors.SessionError: Capability exchange timed out [09:11:22] that thatn converts into an jnpr.junos.exception.ConnectError exception [09:11:34] I didn't look if we can pass all the way down some timeout param to make it longer [09:11:47] assuming it's just a slowness and not something else [09:11:52] not sure it would help though [09:12:11] afaik the timeouts defaults are high enough [09:12:38] do you think is the host not responding or packet loss or something along the line? [09:13:57] most likely the former, or some kind of race condition leading to a lock [09:14:38] we have so little packet loss that a tcp retransmits would be enough [09:15:06] Yeah I’d say it’s more likely the device alright. [09:17:58] IIRC it's almost always the eqsin one [09:18:22] so wondering if we're just unlucky given it's one of the newer or the distance might be a factor [09:28:34] distance may well be a factor yeah. But probably something device-specific also - I've not seen this for the CRs out there. [09:31:01] CPU regularly hits max on it, I'm wondering if that's cos of SSH attempts. It's likely why it's timing out. [09:34:12] ah, is there something like fail2ban we could set ? [09:36:05] something like that would be perfect yeah. [09:36:08] But I don't believe there is anything like that you can run on JunOS. [09:36:32] Looking at the numbers of failures there is not a strong correlation with the times cpu hits max though. [09:37:08] In general there are lots of blocked connections, and the device is logging them all. I wonder if that's causing the high CPU. [09:38:01] ack, let me know if I can help in any way on the homer side, although so far seems soemthing that should be fixed on teh device itself, if possible [09:40:14] yeah. only thing I can think of on the homer side would maybe be to reduce timeouts and increase retries... but I think it's best we try to address on the device. [09:40:46] ack [10:05:12] topranks, volans: https://phabricator.wikimedia.org/T278289 [10:06:27] ack [16:16:24] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10fgiunchedi) [16:27:06] * volans having a look ^ as john i sout [17:00:43] 10Puppet, 10Infrastructure-Foundations: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) So, after a quick check this is what I found: * this is happening on `ms-be105[1-9]`, `ms-be205[1-6]` and `relforge100[3-4]` * the issue was introduced... [17:13:41] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) p:05Triage→03Medium [17:36:10] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10Volans) All the affected hosts are HP and seems to have a `Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)` network car... [17:53:31] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be1051 - https://phabricator.wikimedia.org/T290984 (10cmooney) This may be related to this reported bug. It seems these Intel cards have an on-board LLDP agent, which if enabled cause it to p... [17:57:06] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10Volans) [18:16:55] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4] - https://phabricator.wikimedia.org/T290984 (10cmooney) We attempted disabling the NICs own LLDP parser by echoing the command and it seems to ha... [20:15:13] ... oh my gosh. [20:15:25] I cleared the historical data for two of the statuspage.io metrics [20:15:46] now statograph is failing just on alert2001 because thanos-query there is taking too long (>20s) to return the older data [20:15:51] on just one of those metrics [20:15:53] and [20:16:00] we won't ever get an icinga alert about this [20:16:11] because the only thing that checks the status of systemd services running on alert2001 is alert2001 [20:16:19] because of how we have our 'dual-homed' icinga set up [20:16:40] now none of this is a big deal or anything, but what a toe-stubbing [20:17:34] also, it seemingly just started working fine on alert2001, possibly because the process also running on alert1001 pushed it past the interval where accessing hitorica ldata from thanos was slow