[11:08:04] dhinus: I'm not in front of the laptop, but it just occurred to me that it may be interesting to check if keepalived has flapped in cloudgw since the failover, yesterday [11:08:42] arturo-afk: I kept an eye on it for 1 hour or so, and it didn't flap. let me check now [11:09:23] no flaps! [13:30:27] dhinus: it's too soon to declare victory, but https://phabricator.wikimedia.org/T374830#10415523 [17:07:08] the dns issue isn't fixed :( [17:19:27] my money is still on packet loss somewhere. deciding where that happens is of course the deep magic that none of us has figured out yet. [17:35:25] I'm on the verge of telling the CI folks 'life is uncertain, use git-retry' but I assume that then whatever's happening will just find something new to break [17:45:15] test suites that fail 15 minutes in because of DNS lookup for an NPM dependency are pretty demoralizing for "just try again" to be the solution. [17:46:19] toolforge job failure emails becoming spam because of network failures is pretty rough too [17:46:35] yeah, we can't add retries everywhere [17:48:12] I totally understand the frustration of not being able to find a smoking gun though. [17:50:03] I think that topranks believes/wants to treat this as a dns-specific issue, but you're seeing evidence of more general network failures right? A variety of symptoms, not just dns lookups? [17:50:57] as far as I can see we can't really replicate any dns or network issues right? [17:51:16] the only way we can produce timeouts is if we flood the dns server with requests from a whole bunch of endpoints at once? [17:51:32] I'm not sure in that case we are hitting the same issue that is being observed with the build issues [17:51:41] topranks: that's not strictly correct, I can reproduce the dns issue without flooding. It's just highly intermittent, often hours between failures. [17:51:47] ok [17:51:54] my gitlab-account-approval tool shows signs of DNS failures, but also of other network interruptions during HTTPS requests to internal services (gitlab, phabricator) at various times. [17:51:55] well my 100k checks were error free [17:52:12] if we have a system where 1/100k dns failures make it break we should make the application more tolerant [17:52:26] if we're seeing more failures than that we need to be able to reproduce them [17:53:09] I think there must be 'weather' that causes it to happen more often during certain intervals. [17:54:45] I should put some effort into better error reporting for gitlab-account-approval and see if I can use that to give everyone some new information. The workload in that tool is a pretty natural network + DNS stability test. [17:55:30] yeah, like from my point of view to try and diagnose the problem we need to be tracing failed dns queries (or other network connections) [17:56:03] so like pcaps or similar at all the various points the packet might take + good application logs where we can see the DNS ID of the failed query (like in a dig or something) [17:57:11] topranks: do we still not have valid pcaps that include a failure? [17:57:22] no [17:57:44] ok, I thought rook had gotten you valid ones last week. [17:57:52] nah [17:57:54] I will work on that, although the logs will be enormous [17:58:03] I only need the log of one :) [17:58:25] but again, if "the logs will be enormous" as it's 1 out of 1 million failing..... then the app needs to be more tolerant [17:58:51] sure but it's clearly not just one app that's suffering from this [17:59:02] I suspect what's happening here is something on these local systems which is preventing them making the network connections, and there is no issue on network / dns server. If there was we'd have had failures in the tests I ran [17:59:03] and in the case of CI, the app is git [17:59:27] yeah, so git is not going to give up after 1 failed dns query [17:59:45] I wouldn't think so [18:00:57] if we can reliably reproduce the issue we can systematically troubleshoot the issue [18:01:55] as things stand now I'm at a dead end as all tests are testing clean, and we've really hammered it at this stage [18:03:07] yeah [18:04:19] CI surfaces the problem more reliably because a) we run a lot of CI jobs and b) people pay attention to CI job failures. The instability is everywhere in the Cloud VPS network in my experience and understanding. [18:07:08] bd808: if we had instability "everywhere in the network" surely we could reproduce it? [18:12:31] That is typically what the network engineers say, yes. As you implied before with the comment about making applications more tolerant of failures, we hide a lot of the pain by adding more layers that work around it. I put 6 weeks of work into wikibugs and made it recover better, etc. [18:17:05] Being pragmatic - there is a missing piece here [18:18:15] On the face of it from the results of the tests I ran the error rate is non-existent, or possible in the order of 1 out of a million requests or so [18:18:35] If that rate of error was truly causing major problems there would be a case for the application to be more tolerant [18:19:14] I don't think git, or Linux or whatever the HTTPS client is - are going to completely fall over with such a low instance of errors though [18:20:22] They are either seeing more errors than we've found in the synthetic tests we've run, or there is something else causing them to fail to make those connections [18:20:37] Either way we need to find the pattern of what is failing, from what devices, at what times [18:20:58] in an effort to reproduce the issue, and then be able to look at it occurring and troubleshoot [18:24:19] if the message is "there is instability everywhere in the cloud network", then lets try to find a way to demonstrate that and start digging into the problems [18:24:42] on the face of it I don't know what to check next cos everything we've been asked to look at so far is testing clean