[07:22:29] !log rolling-restart of swift frontends [07:22:30] Emperor: Not expecting to hear !log here [07:23:03] ah, the cookbook logs it anyway [08:09:59] So we had another spike in errors between ATS and swift - if you look at https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-site=All you can see a brief rise in 504 around 16:55 UTC yesterday, followed by a sustained rate of about 4-5 502/s until 07:25 UTC this morning when I did the rolling-restart [08:13:23] you can see it in turnilo https://w.wiki/6nMd (though the 504 spike is too small to see) [08:14:51] likewise superset https://superset.wikimedia.org/superset/dashboard/p/pZxOpw8rVm0/ [08:15:54] But swift doesn't really log anything interesting (or at least nothing I've found yet); it doesn't seem to be a problem with tempauth (since we now graph that) [08:16:38] and traffic doesn't _look_ unusual at the point things start going wrong (although I don't know what I should be looking for) [08:17:45] and I suspect the errors are ATS getting bored of waiting for swift (rather than swift itself being sad) [08:18:32] but I don't have good evidence for that (as opposed to a lack of evidence of e.g. backtraces in swift logs), nor good ways of looking for it. And "restart everything" fixes things, but is unsatisfactory. [08:19:34] [the one thing we don't have is any sort of visibility into the state of memcached] [08:19:46] s/the one/another possibly important/ [08:30:52] I suppose it's also worth noting that it's only a tiny fraction of requests that are getting these error responses (so it's not like swift has collapsed or anything); but we do at least sometimes get user complaints that uploading isn't working for them [08:41:59] [one of the object servers was saying ECONNREFUSED to swift-recon this morning, but restarting cleared that, and that was a backend] [08:48:58] That fact it starts in both DCs almost identically makes me suspect write traffic [08:50:13] ...but there's nothing obviously unusual in the graphs [13:22:57] Emperor: write traffic? as-in the rate of? [13:25:45] I doubt rate thereof (we have graphs of that, nothing notable), but maybe something unusual (large size???) [13:30:30] There's nothing obvious on the memcached dashboard, though https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad+prometheus%2Fops&var-cluster=swift&var-instance=All&from=1685539810710&to=1685712610710 [13:32:04] I mean other than that ms-fe1009 seems to be doing more memcached work than any other node by a factor of 4-5...(!?!), but that seems unrelated to our incident [13:33:12] wouldn't large size increase transfer rate though? [13:33:36] similarly in codfw ms-fe2011 is doing a load more work than the other frontends [13:34:32] urandom: perhaps. But maybe only a few large objects would be lost in the noise of a lot of smaller ones [13:35:43] it's weird that both clusters have 1 node that seems to be doing lots more memcached work than the others [13:36:04] ...but maybe this is some nuance of the swift/memcached interaction I don't understand [13:39:28] you're referring to this I assume? https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=swift&var-instance=All&from=1685539810710&to=1685712610710&viewPanel=42 [13:39:39] i.e. "gets" [13:40:07] that's...a lot more than 4-5x, no? [13:41:32] well, bytes read and bytes written panels, but the pattern's the same [13:41:57] yes, you're right that it is quite a lot more. [13:43:34] Emperor- could it be my backups? [13:44:09] I can pause them to test [13:44:15] jynus: dunno, do you only target one frontend in each DC? [13:44:34] I belive so [13:44:38] [you'd have to be trying to do so, I think, rather than using the discovery record] [13:45:10] but 3K ops/s I don't think that's my rate [13:45:22] I doubt I do more than 1 op / s [13:45:32] https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-site=All&from=1685540716967&to=1685713456967&viewPanel=34 <-- suggests query rate per server is similar [13:45:49] but let me pause for some minutes on one dc? [13:45:51] ok? [13:45:59] So there is something else going on here that's not just query rate [13:46:11] jynus: sure, if that's easy we can at least rule that out :) [13:46:21] eqiad or codfw? [13:47:37] codfw? [13:47:44] ok, doing [13:47:48] the memcache graphs are confusing: the rate of set and incr don't stand out for fe1009, just get, but both bytes read *and* written standout [13:47:57] mmm [13:48:50] jynus: thanks; if it's OK to leave off for 5m or so that should give us a clear window to look at [13:48:52] I have stopped the downloading of swift originals now [13:48:57] yes it is [13:49:09] wouldn't you see a corresponding increase in write operations if write bytes were that much of an outlier? [13:49:20] I am keeping the metadata update but that only queries mysql, not swift [13:50:52] urandom: yeah, I think there is no evidence to support my "uploading some large objects made swift unhappy" theory. But then whilst this memcached uneveness in loading is weird, there's nothing that obviously correlates with the rise in 5xxs we saw either :( [13:52:44] ya :/ [13:53:51] (IWBNI there was an openstack swift forum where one could say "is it just me, or does everyone's cluster do that?") [13:57:48] there's an IRC channel but it's pretty quiet except for bot traffic [13:59:54] Emperor: any change in some metric? [14:02:40] I think I should only be doing 1 GET at upload rate in average (as expected) [14:03:37] jynus: no, nothing at all (which doesn't surprise me, but it's useful to have checked), feel free to restart [14:04:07] I prefer to be safe, I am 99% confident of the logic, but you never know! [14:19:04] restarted- there may be small spike to catch up, but now back to normal download rate