[04:22:39] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:23:15] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:23:45] FIRING: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:27:39] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:28:45] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:32:40] RESOLVED: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:33:15] RESOLVED: SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:38:03] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9976274 (10cmooney) 05Open→03Resolved [14:04:23] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976672 (10Joe) A couple of notes: * Overriding `name` to always be the same, and just one object per tag group, makes syncing and querying less efficient an... [14:43:38] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976790 (10ssingh) Thanks for the feedback @Joe! >>! In T369366#9976672, @Joe wrote: > A couple of notes: > > * Overriding `name` to always be the same, and... [14:50:06] 06Traffic, 06SRE, 13Patch-For-Review: Migrate DNS depooling of sites from operations/dns (git) to confctl - https://phabricator.wikimedia.org/T369366#9976817 (10ssingh) Final (famous last words) form: ` confctl --object-type geodns select 'geodns=generic-map,name=eqiad' get ` with the key being: ` /conft... [15:30:15] If I have a service that needs very long (1 day) timeouts in lvs, is there any config in the service catalog needed or is it just left to other parts of the stack? I don't see anything in puppet for it but just want to rule it out [15:40:47] timeout of a day? [15:41:07] yeah, it's the shellbox instances for videoscaling [15:41:12] the LVS timeout is 900 seconds and then various HTTP timeouts are documented at https://wikitech.wikimedia.org/wiki/HTTP_timeouts (this might not be up to date) [15:42:27] LVS timeout? [15:42:38] sukhe@lvs1020:~$ sudo ipvsadm -l --timeout [15:42:38] Timeout (tcp tcpfin udp): 900 120 300 [15:43:08] maybe for tracking purposes? [15:43:49] but LVS itself doesn't perform TCP handshakes against the realservers [15:46:10] from https://wikitech.wikimedia.org/wiki/HTTP_timeouts it looks like it's appservers who connect to videoscaler.discovery.wmnet and their timeout is 1 day (86400s) [15:47:21] ah so lvs shouldn't matter as such. just wanted to confirm, we're trying to track down some rather elusive timeouts against the new service [15:47:33] hmm [15:47:39] LVS in low traffic perform balancing at L2 [15:47:54] rewrites the MAC address of the destination and forward the packet [15:48:03] no TCP timeout involved at all [15:48:25] vgutierrez: but I assume LVS times out its conn table entries? [15:49:04] kamila_: but it won't set an RST back to the client AFAIK [15:49:21] *send [15:49:31] that makes sense, it'd just send the next packet somewhere random, right? [15:49:34] the LVS only handles the RX traffic [15:49:48] right [15:50:03] kamila_: considering the scheduler, that's possible, yes [15:50:06] are we seeing RSTs? [15:50:12] * kamila_ just joined [15:50:33] not sure, I was basing my understanding on the timeout for conn tracking :) [15:51:57] * kamila_ wonders what a videoscaler pod does if it gets the middle of someone else's connection [15:52:17] but yeah, I'm not really sure where in the stack it is, could be completely unrelated to LVS [15:52:50] all I know is it works on my laptop, so that narrows things down to everything minus what's on my laptop, which is approximately everything :D [15:55:46] kamila_: I'm assuming that there are keepalives and no radio silence from the client during those 86400s? [15:56:27] keepalive as in TCP keepalive [16:02:51] I don't think there are, I proposed adding them [16:03:00] (but haven't figured out how to do that yet) [16:04:50] it would be interesting to see a .pcap of the issue you're seeing if any [16:05:55] historically we got 1 day timeouts for videoscalers forever (at least 6 years and 5 months) without any issues related to LVS and its 900s timeout for TCP sessions [16:06:09] right [16:06:33] then it might be something other than LVS, I'm just not sure what [22:38:29] 06Traffic, 06Content-Transform-Team-WIP, 10iOS-app-feature-Performance, 10RESTBase, and 6 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365#9978558 (10JTannerWMF)