[11:08:04] <arturo-afk>	 dhinus: I'm not in front of the laptop, but it just occurred to me that it may be interesting to check if keepalived has flapped in cloudgw since the failover, yesterday 
[11:08:42] <dhinus>	 arturo-afk: I kept an eye on it for 1 hour or so, and it didn't flap. let me check now
[11:09:23] <dhinus>	 no flaps!
[13:30:27] <andrewbogott>	 dhinus: it's too soon to declare victory, but https://phabricator.wikimedia.org/T374830#10415523
[17:07:08] <andrewbogott>	 the dns issue isn't fixed :(
[17:19:27] <bd808>	 my money is still on packet loss somewhere. deciding where that happens is of course the deep magic that none of us has figured out yet.
[17:35:25] <andrewbogott>	 I'm on the verge of telling the CI folks 'life is uncertain, use git-retry' but I assume that then whatever's happening will just find something new to break
[17:45:15] <bd808>	 test suites that fail 15 minutes in because of DNS lookup for an NPM dependency are pretty demoralizing for "just try again" to be the solution.
[17:46:19] <bd808>	 toolforge job failure emails becoming spam because of network failures is pretty rough too
[17:46:35] <andrewbogott>	 yeah, we can't add retries everywhere
[17:48:12] <bd808>	 I totally understand the frustration of not being able to find a smoking gun though.
[17:50:03] <andrewbogott>	 I think that topranks believes/wants to treat this as a dns-specific issue, but you're seeing evidence of more general network failures right? A variety of symptoms, not just dns lookups?
[17:50:57] <topranks>	 as far as I can see we can't really replicate any dns or network issues right?
[17:51:16] <topranks>	 the only way we can produce timeouts is if we flood the dns server with requests from a whole bunch of endpoints at once?
[17:51:32] <topranks>	 I'm not sure in that case we are hitting the same issue that is being observed with the build issues 
[17:51:41] <andrewbogott>	 topranks: that's not strictly correct, I can reproduce the dns issue without flooding. It's just highly intermittent, often hours between failures.
[17:51:47] <topranks>	 ok
[17:51:54] <bd808>	 my gitlab-account-approval tool shows signs of DNS failures, but also of other network interruptions during HTTPS requests to internal services (gitlab, phabricator) at various times.
[17:51:55] <topranks>	 well my 100k checks were error free 
[17:52:12] <topranks>	 if we have a system where 1/100k dns failures make it break we should make the application more tolerant 
[17:52:26] <topranks>	 if we're seeing more failures than that we need to be able to reproduce them 
[17:53:09] <andrewbogott>	 I think there must be 'weather' that causes it to happen more often during certain intervals.
[17:54:45] <bd808>	 I should put some effort into better error reporting for gitlab-account-approval and see if I can use that to give everyone some new information. The workload in that tool is a pretty natural network + DNS stability test.
[17:55:30] <topranks>	 yeah, like from my point of view to try and diagnose the problem we need to be tracing failed dns queries (or other network connections) 
[17:56:03] <topranks>	 so like pcaps or similar at all the various points the packet might take + good application logs where we can see the DNS ID of the failed query (like in a dig or something) 
[17:57:11] <andrewbogott>	 topranks: do we still not have valid pcaps that include a failure?
[17:57:22] <topranks>	 no 
[17:57:44] <andrewbogott>	 ok, I thought rook had gotten you valid ones last week.
[17:57:52] <topranks>	 nah 
[17:57:54] <andrewbogott>	 I will work on that, although the logs will be enormous
[17:58:03] <topranks>	 I only need the log of one :) 
[17:58:25] <topranks>	 but again, if "the logs will be enormous" as it's 1 out of 1 million failing.....   then the app needs to be more tolerant 
[17:58:51] <andrewbogott>	 sure but it's clearly not just one app that's suffering from this
[17:59:02] <topranks>	 I suspect what's happening here is something on these local systems which is preventing them making the network connections, and there is no issue on network / dns server.  If there was we'd have had failures in the tests I ran 
[17:59:03] <andrewbogott>	 and in the case of CI, the app is git
[17:59:27] <topranks>	 yeah, so git is not going to give up after 1 failed dns query 
[17:59:45] <andrewbogott>	 I wouldn't think so
[18:00:57] <topranks>	 if we can reliably reproduce the issue we can systematically troubleshoot the issue 
[18:01:55] <topranks>	 as things stand now I'm at a dead end as all tests are testing clean, and we've really hammered it at this stage 
[18:03:07] <andrewbogott>	 yeah
[18:04:19] <bd808>	 CI surfaces the problem more reliably because a) we run a lot of CI jobs and b) people pay attention to CI job failures. The instability is everywhere in the Cloud VPS network in my experience and understanding.
[18:07:08] <topranks>	 bd808: if we had instability "everywhere in the network" surely we could reproduce it?
[18:12:31] <bd808>	 That is typically what the network engineers say, yes. As you implied before with the comment about making applications more tolerant of failures, we hide a lot of the pain by adding more layers that work around it. I put 6 weeks of work into wikibugs and made it recover better, etc.
[18:17:05] <topranks>	 Being pragmatic - there is a missing piece here 
[18:18:15] <topranks>	 On the face of it from the results of the tests I ran the error rate is non-existent, or possible in the order of 1 out of a million requests or so 
[18:18:35] <topranks>	 If that rate of error was truly causing major problems there would be a case for the application to be more tolerant 
[18:19:14] <topranks>	 I don't think git, or Linux or whatever the HTTPS client is - are going to completely fall over with such a low instance of errors though 
[18:20:22] <topranks>	 They are either seeing more errors than we've found in the synthetic tests we've run, or there is something else causing them to fail to make those connections 
[18:20:37] <topranks>	 Either way we need to find the pattern of what is failing, from what devices, at what times 
[18:20:58] <topranks>	 in an effort to reproduce the issue, and then be able to look at it occurring and troubleshoot 
[18:24:19] <topranks>	 if the message is "there is instability everywhere in the cloud network", then lets try to find a way to demonstrate that and start digging into the problems 
[18:24:42] <topranks>	 on the face of it I don't know what to check next cos everything we've been asked to look at so far is testing clean