[08:20:11] topranks, XioNoX I want to write an icinga check that tests that MSS clamping is happening and I ideally that it isn't super complex [08:20:36] scapy works fairly well [08:20:38] https://www.irccloud.com/pastebin/luttqKHb/ [08:21:01] and you get [('MSS', 1436)] as the output [08:21:32] nice! [08:22:07] what I don't like about sr1() (send receive one packet) is that it's actually sniffing the response with libpcap [08:23:36] a raw socket should be enough to get the synack response to the initial syn [08:28:39] vgutierrez: not an expert but maybe this is what you want: https://stackoverflow.com/questions/57010477/how-to-change-tcp-header-and-options-using-pythons-socket-library#comment109292318_57010643 ? [08:29:50] so that would set TCP_MAXSEG to an arbitrary value [08:30:22] I was looking at the "getsockopt" below, but it's only the outbound value, not the inbound [08:30:40] what I need is to check that the MSS option sent by the server is less or equal to the configured TCP MSS clamping value [08:30:53] yeah... userland doesn't have visibility over the 3way handshake [08:31:08] one way would be using eBPF [08:31:18] but I don't wanna use eBPF for an icinga check [08:33:56] looks like ebpf solves all the problems :) [08:34:30] yeah... a raw socket should be enough in this case [08:36:23] scapy provides a L3RawSocket class... let's see if I can leverage that rather than sniffing stuff :_) [08:38:36] please consider dropping a prometheus textfile for node-exporter instead of an icinga check [08:39:08] godog: yeah, no problem with that [08:39:39] cheers vgutierrez ! [08:45:36] hmm scapy isn't that bad if scapyconf.sniff_promisc = 0 is used [08:45:56] cool :) [08:53:33] XioNoX: the other thing I don't like is.. sport=RandShort() on the TCP layert [08:53:50] that's basically picking the source port using a random number generator [08:53:59] so it could match an already existing connection [08:54:20] I guess I could bind a socket, get the assigned port, use it and close the socket after finishing the test [08:54:26] * vgutierrez testing [08:57:57] nope... kernel stack gets that synack :/ [08:59:01] I'm replying to your comment about which MSS to use, and noticed that Cloudflare looked at which MSS we use in https://blog.cloudflare.com/increasing-ipv6-mtu/ :) almost going full circle now [09:02:24] vgutierrez: eh "The team had underestimated the complexity of changing the MTU across all our racks of equipment." - https://blog.cloudflare.com/high-availability-load-balancers-with-maglev/ [09:08:07] XioNoX: https://github.com/cloudflare/ipvs is pretty cool [09:08:39] are you going to use it? [09:08:48] instead of katran? [09:09:40] thing is.. we could implement both in LiBerica control plane and switch at will between the two.. [09:09:40] I thought it was an abstraction layer to manage it [09:10:08] https://github.com/cloudflare/ipvs is a go package that let you manage IPVS [09:10:38] ah right, so it replaces ipvsadm [09:11:00] yep, both should work [09:11:11] the cloudflare package talks to the kernel via netlink [09:16:27] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10ayounsi) That's a great question. I don't think we have the resources to do an extensive investigation. I see 2 options: # either we only subtract the tunnel header from the default MSS... [09:19:29] godog: do you have a puppetization example handy of node_exporter file drop + systemd timer? :) [09:20:10] hmm prometheus::node_trafficserver_config should do it [09:45:09] vgutierrez: yeah that, also check out prometheus::node_textfile which might be more convenient, modules/profile/manifests/firewall.pp has an example [09:45:28] yeah.. I was checking node_ssh_open_sessions [15:08:56] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) thx @ayounsi we will go with option 1: * IPv4: 1500 - 20 (IP) - 20 (IP) - 20 (TCP) = 1440 bytes * IPv6: 1500 - 40 (IPv6) - 40 (IPv6) - 20 (TCP) = 1400 bytes [16:13:42] (SystemdUnitFailed) firing: anycast-healthchecker.service Failed on dns4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:13:52] ^ this is fine [16:15:26] 10Traffic, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE-tools, and 3 others: Switch conftool to use the version 3 etcd datastore - https://phabricator.wikimedia.org/T350565 (10jbond) [16:18:42] (SystemdUnitFailed) firing: (2) anycast-healthchecker.service Failed on dns4003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:25:32] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:27:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) [16:27:36] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Move asw2-c8-eqiad to spares - https://phabricator.wikimedia.org/T349798 (10Jclark-ctr) 05Stalled→03Resolved [16:38:42] (SystemdUnitFailed) resolved: tcp-mss-clamper.service Failed on ncredir4002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:07:48] 10Traffic, 10Data-Engineering, 10Observability-Logging: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Milimetric) Besides the great discussion above, I just want to point out some related things. * Varnish captures timestamps in a specific way as part of its loggi... [17:57:43] (SystemdUnitFailed) firing: ipip-multiqueue-optimizer.service Failed on lvs4010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:12:42] (SystemdUnitFailed) resolved: ipip-multiqueue-optimizer.service Failed on lvs4010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:15] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) Looping in @CDanis as the original author for the [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/789219 | cp1075 hiera overrides ]... [18:42:24] 10Traffic: Consolidate hieradata for new eqiad cp hosts - https://phabricator.wikimedia.org/T352078 (10Fabfur) [20:29:28] 10Traffic, 10SRE, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [20:34:07] 10Traffic, 10Wikidata, 10wmde-wikidata-tech, 10Wikimedia-production-error: 503 on Wikidata - https://phabricator.wikimedia.org/T352094 (10AlexisJazz) [21:07:14] 10Traffic, 10Wikidata, 10wmde-wikidata-tech, 10Wikimedia-production-error: 503 on Wikidata - https://phabricator.wikimedia.org/T352094 (10ssingh) Does this still persist for you? We had a [[ https://grafana.wikimedia.org/d/pr6ZUm5nz/haproxy-cluster-view?orgId=1&var-site=esams&var-cluster=text | blip ]] tha...