[09:24:34] 10Puppet, 10Infrastructure-Foundations, 10observability, 10cloud-services-team (Kanban): 2 systemctl services failing on cloudcontrol hosts: prometheus-openstack-exporter and logrotate - https://phabricator.wikimedia.org/T303511 (10aborrero) [10:38:37] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ayounsi) @Jclark-ctr let's hold on putting public hosts in the new rows for now. So ideally those would go to A-D. [11:15:32] volans, XioNoX: perhaps one of you can help me [11:15:49] (sry volan.s you are off leave it) [11:16:48] I notice if I run homer as verbose it always shows these RPC errors for "statement has no contents", but they seem to be ignored? [11:16:52] https://www.irccloud.com/pastebin/HTmo3PzI/ [11:17:38] Which makes sense and is fine. [11:18:09] I'm toying with a template to configure the port speeds within "chassis fpc", but every time this is applied the router returns a few warning messages. [11:18:33] https://www.irccloud.com/pastebin/YNmtnzHn/ [11:19:09] ^^^ like this one. I'm wondering did we do anything to tell homer to ignore the first type of error? or if you'd any idea on how we might tackle this? [11:19:55] This second warning is currently causing the homer job to fail [11:20:51] topranks: https://doc.wikimedia.org/homer/master/configuration.html#config-yaml see the ignore_warning bit [11:21:19] ah! should have checked the docs sry [11:21:32] appreciate it volan.s thanks, can't ask better than that for an answer :) [11:21:38] from https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/homer/templates/config.yaml.erb [11:21:46] I mean in prod is defined in ^^^ [11:22:31] now how much is safe or not to ignore those and your new one, I'll defer to you and ar.zhel :) [11:23:14] haha ok yep no problem on that :) [11:24:29] as for how that's used in junos see the link in [11:24:29] https://doc.wikimedia.org/homer/master/api/homer.transports.junos.html#homer.transports.junos.ConnectedDevice.commit [11:25:00] ah ok [11:25:06] that's actually very useful [11:25:15] it can be a full regex [11:25:46] according to junos pyex docs [11:25:49] *pyez [11:26:38] nice! you are one step ahead of me :) [11:26:53] yeah as my error has the port & speed in it - which will change - it would be good to do more of a pattern match [11:26:55] I will dig into it [11:26:57] thanks! [11:27:29] or just 'config will be applied to ports' [11:27:53] unless they match the full string, I don't know [11:28:31] I'm testing it now will let you know [11:29:46] Yeah it worked fine with "config will be applied to ports" added to the list of messages to ignore [11:30:13] I didn't need to bother with any regex / wildcards to cover the start and end of the message [14:15:06] is netbox down? [14:15:35] worksforme [14:16:19] paravoid@daisy:~$ openssl s_client -connect netbox.wikimedia.org:443 [14:16:22] CONNECTED(00000003) [14:16:25] GET / HTTP/1.1 [14:16:27] Host: netbox.wikimedia.org [14:16:30] and times out [14:16:31] no response [14:17:58] https://www.irccloud.com/pastebin/pAWUSFo2/ [14:18:04] _joe_ was also reporting some oddball problems [14:18:19] interesting [14:18:24] <_joe_> so [14:18:27] with netbox or in general? [14:18:36] with other things in eqiad [14:18:38] the effects can be mtu related [14:18:40] <_joe_> paravoid: to the puppet compiler [14:18:56] <_joe_> I have one more datapoint [14:19:04] <_joe_> from my debian desktop, I have issues [14:19:13] <_joe_> from my laptop running ubuntu, zero. [14:20:32] <_joe_> and for added fun [14:20:44] <_joe_> for me, netbox works [14:20:57] <_joe_> but openssl s_client -connect puppet-compiler.wmflabs.org:443 behaves exactly like for paravoid [14:21:08] I've ssh'ed into the box [14:21:13] can one of you get a tcpdump from the far end [14:21:15] tcpdump shows it's sending packets out, but I never receive them [14:21:20] large packets? [14:21:24] <_joe_> same here [14:21:31] 1440, but large ICMP work for me [14:21:54] <_joe_> yes, ping worked and even large payloads work [14:21:54] networks are the worst [14:22:08] XioNoX, topranks ^ [14:22:09] <_joe_> and mtr shows my connection with no packet loss [14:22:21] _joe_: tcp mtr? [14:22:45] <_joe_> uhm don't remember but I think so, lemme check what I ran [14:22:58] <_joe_> cdanis: yes [14:23:17] <_joe_> so earlier I had a similar issue [14:23:31] <_joe_> when trying to connect to mirror.wikimedia.org *from a docker container* it would fail [14:23:36] <_joe_> from the host, it would work [14:24:03] <_joe_> I'm not sure what's going on but it surely defies my networking knowledge [14:24:04] paravoid: dodn;t ping get routed to the ping hosts? [14:24:18] s/ping/icmp/ [14:24:31] to netbox1001? I wouldn't think so [14:25:14] yeah so [14:25:16] ack, i thought it was all icmp from the edge but its not something im too familure with so suspect yuo know more then me :) [14:25:18] pmtu is broken [14:25:35] topranks: perhaps the EVPN patches? [14:27:25] yeah large pings to my IP fail [14:27:36] but not if I sent them with don't fragment [14:27:49] so I wonder if it's that small fragments are dropped rather than large [14:28:50] traceroute to my IP goes through eqx2-ash.new.seabone.net, ae8.francoforte73.fra.seabone.net, and then my ISP [14:32:43] ok I think this is it [14:32:59] smells like a runt packets situation or something like that [14:33:11] paravoid@daisy:~$ ping -s 1464 netbox1001.wikimedia.org [14:33:11] PING netbox1001.wikimedia.org (208.80.154.12) 1464(1492) bytes of data. [14:33:14] 1472 bytes from netbox1001.wikimedia.org (208.80.154.12): icmp_seq=1 ttl=56 time=158 ms [14:33:17] paravoid@daisy:~$ ping -M do -s 1465 netbox1001.wikimedia.org [14:33:20] PING netbox1001.wikimedia.org (208.80.154.12) 1465(1493) bytes of data. [14:33:23] ping: local error: message too long, mtu=1492 [14:33:25] paravoid@daisy:~$ ping -s 1465 netbox1001.wikimedia.org [14:33:28] PING netbox1001.wikimedia.org (208.80.154.12) 1465(1493) bytes of data. [14:33:31] ^C [14:33:40] (same from another box with mtu 1500) [14:34:17] Sorry I’m afk right now getting my bloods done. [14:34:27] Will look when I’m back if it’s ongoing [14:34:50] paravoid: yes confirm i get the same [14:35:42] well `ping -4s 1465 185.143.92.17 ` works but with DO i get the same error [14:35:52] no that's normal [14:36:02] "the message too long" with -M do, is normal [14:36:37] absence of a response, either a time=NNN ms or a "message too long" is a problem [14:39:10] fragments are being dropped [14:39:25] doesn't matter the size, not it wasn't a runt packet thing [14:39:28] has anything changed recently? [14:40:54] Don’t believe there was any changes on CRs that would affect them no [14:41:08] hi [14:41:11] hi :) [14:41:14] reading scrollback [14:41:15] was about to call :) [14:41:34] tl;dr there is some kind of packet drop issue, that I think I've pinpointed into fragment being dropped [14:42:04] are paths going though CF? [14:42:16] no [14:42:26] and v4 only? [14:42:35] so far yes [14:42:54] paravoid: `ping -M dont -4s 1472 185.143.92.17 ` works for me so perhaps something further out then our network? [14:43:21] jbond: try ping -s 2000 netbox1001.wikimedia.org [14:44:27] so these packets never arrive on netbox1001; but what is surprising is that the original issue I had was in the response path [14:44:36] I wonder if these are two separate issues [14:44:58] <_joe_> XioNoX: same symptoms as me [14:45:06] <_joe_> only I have them with another host [14:45:10] 'ping -s 2000' doesn't work towards any bastions of ours, for what it's worth, it could be orthogonal [14:46:29] can you share an mtr with -s 2000 ? [14:47:02] paravoid: from my testing from netbox. anything up to 1464 with DF set gives DF needed. anything upto 1472 (which + 8 icmp header + 20 ip header == 1500) works, anything above that fails [14:48:14] is it the same toward a router, let's say cr1-eqiad.wikimedia.org ? [14:49:00] ok I think we are looking at two separate issues [14:49:59] XioNoX: i get an mtr error if trying to use s > 1500 [14:49:59] Unexpected mtr-packet error [14:50:05] one is e.g. 'ping -s 2000' towards all sites/hosts (yes including routers) [14:50:18] anything that results into the packet being fragmented [14:50:33] by tcpdumping on the netbox host, these packets never arrive there at all [14:50:50] and I've tried it from multiple vantage points [14:50:51] iirc we discard frag icmp [14:50:56] ok, that would explain t [14:50:58] it [14:51:02] so that was red herring in that case [14:51:02] checking [14:51:22] the other issue that only happens from one of my two DSLs but not the other [14:51:58] is, at the high level, that curl https://netbox.wikimedia.org/ times out [14:52:25] by tcpdumping on both ends, I see the request landing on netbox1001, netbox1001 responding, but the response never being received on my end [14:52:37] yeah, see https://www.irccloud.com/pastebin/1BTx53rm/ [14:52:52] ack [14:53:01] sorry for that detour [14:53:14] better to have 1 issue than 2 :) [14:53:18] haha [14:54:14] 14:53:40.490775 IP 208.80.154.12.443 > XXX.53986: Flags [.], seq 1:1441, ack 518, win 84, options [nop,nop,TS val 120464442 ecr 4083288290], length 1440 [14:54:17] 14:53:40.938739 IP 208.80.154.12.443 > XXX.53986: Flags [.], seq 1:1441, ack 518, win 84, options [nop,nop,TS val 120464890 ecr 4083288290], length 1440 [14:54:20] 14:53:41.802787 IP 208.80.154.12.443 > XXX.53986: Flags [.], seq 1:1441, ack 518, win 84, options [nop,nop,TS val 120465754 ecr 4083288290], length 1440 [14:54:23] 14:53:43.530771 IP 208.80.154.12.443 > XXX.53986: Flags [.], seq 1:1441, ack 518, win 84, options [nop,nop,TS val 120467482 ecr 4083288290], length 1440 [14:54:26] 14:53:47.082836 IP 208.80.154.12.443 > XXX.53986: Flags [.], seq 1:1441, ack 518, win 84, options [nop,nop,TS val 120471034 ecr 4083288290], length 1440 [14:54:29] these never arrive [14:54:48] both my and _joe_'s IPs are over seabone [14:55:03] <_joe_> yes [14:55:07] let's downpref that path and see if it changes anything? [14:55:12] paravoid: is that still happening? if so I can try to disable the session and see [14:55:19] it is [14:55:33] <_joe_> still happening for me too [14:55:48] alright, give me 2 min to downpref it and 2 so it propagates [14:55:56] can someone check NEL in the meantime? [14:56:01] ack [14:56:37] we don't emit NEL headers for any sites that don't go through Varnishes, btw [14:57:58] cdanis: yeah but it should have a larger impact than just netbox, anything with big reply payload [14:58:11] yeah [14:58:29] I'm looking but can't find a smoking gun [14:59:15] it could also be on the ashburn-europe seabone paths, so we wouldn't have enough traffic there [14:59:25] yeah that's kind of what I'm thinking [14:59:42] er, I added that avoid-path in esams... [14:59:46] haha [15:00:10] I was thinking "it sure takes long to propagate" [15:01:29] alright applied to cr2-eqiad [15:02:00] and sure enough [15:02:02] works [15:02:05] and cr1, as we learn them from the RS as well [15:02:18] _joe_: ^^ [15:02:23] wfm now [15:02:36] now I forgot what I wanted to do with netbox... [15:02:41] <_joe_> yep here too [15:02:44] <_joe_> ahah ofc [15:03:20] <_joe_> XioNoX: not sure it's the ccase [15:03:25] paravoid: will you reach to your contact? [15:03:28] <_joe_> (that large packets are involved) [15:03:47] XioNoX: it's friday 5pm, so better luck with the noc I think [15:03:49] <_joe_> but seabone definitely was [15:04:14] yeah I'm not sure what it was either [15:04:24] I didn't have any issues with my ssh session for example [15:04:32] to bast1003 [15:04:38] ls -lR / worked too [15:04:41] weird spooky ecmp stuff perhaps [15:05:02] hmm, how to phrase that so their noc undertstands... [15:05:18] yeah no idea [15:05:38] worst case we leave the peering down for a week, reenable and see if it got fixed :) [15:05:53] I'll try to write a quick patch to turn on NEL for Netbox [15:05:57] maybe with 1.0 sampling fraction :P [15:06:16] yeah, it's downpref, so still live for inbound traffic [15:06:24] on an unrelated note [15:06:36] An exception occurred: Exception: Cannot connect to PuppetDB https://puppetdb-api.discovery.wmnet:8090///v1/facts/is_virtual - 502 502 Bad Gateway

502 Bad Gateway

[15:06:40] says netbox [15:06:46] Exception: Cannot connect to PuppetDB https://puppetdb-api.discovery.wmnet:8090///v1/facts/is_virtual - 502 [15:07:50] paravoid: looking [15:09:07] I guess there are not a lot of users in greece and italy that connect to eqiad [15:09:23] no, but seabone is big in south america [15:09:29] XioNoX: and for cases where that's true, we're missing NEL data [15:10:00] cdanis: time to add BGP data to NEL :) [15:10:09] eheheh [15:10:12] so depending on where this issue is within seabone, it could affect the projects [15:11:16] yeah, NEL should give a decent picture on how impactful it is [15:11:34] paravoid: where do you see that error. there was an issue with the uwsgi service but i have restared it and all (curl and manually runing the netbox reports) looks good [15:12:21] jbond: seems ok now - when you login on the right hand side there are these reports; two were saying "Errored" before and if you clicked on them it had what I pasted above [15:12:34] paravoid: smaller MTUs were working fine, right? [15:12:43] ahh ok cool yes should be fixed now [15:12:58] XioNoX: not really [15:13:05] I couldn't reproduce with ping [15:13:14] curl https did not work [15:13:24] ok [15:13:35] the initial tcp handshake worked [15:13:51] and TLS handshake, IIRC [15:14:10] the body failed, which is why I suspected MTU issues [15:14:11] paravoid: it's hard to say tht anything is wrong on https://logstash.wikimedia.org/goto/a0bc0ff360b12ebb348815f3a5a9db56 [15:14:17] paravoid: isn't that a sign of smaller packet size working fine? [15:14:22] yes [15:14:29] ok cool [15:14:36] I even turned back on the 'noisier' things that might show this issue like h2.ping_failed, 'abandoned', and 'unknown' [15:19:53] email sent, noc@ cced [15:27:30] XioNoX: sneek preview https://puppet-compiler.wmflabs.org/pcc-worker1001/1/cp1075.eqiad.wmnet/fulldiff.html [15:27:48] its been a long time comong but assuming the review gods are an my side i hope to have this hit prod next week [15:28:06] sweet! [15:28:27] jbond: :O [15:28:29] once in it should be much much easier to add addtional data [15:28:38] liek networks etc [16:43:07] nice! [16:43:36] nitpick would be to be clear on what "row" means, netbox uses "rack group" [16:44:19] and the mapping may not be 1:1 forever [16:44:51] (there was a whole conversation around the layout of the eqiad expansion -- equinix had some limitations originally on how it would deliver the racks we requested) [17:14:24] paravoid: thanks i think for now i have just been concentrating on getting something from netbox -> puppet. once this is in place it should be very simple to bike shed on what we want and how to present it