[09:53:08] jbond: do you have a minute to look at https://puppet-compiler.wmflabs.org/pcc-worker1002/37956/ - PCC fails on two of two nodes but the run passes with "SUCCESS (NOOP 2)" (cc klausman) [09:54:44] jayme: there is currently a bug where pcc puts hosts in the wrong secriotn. i hpoe to fix this in the next release [09:54:54] for this specific issue the following host actully has an error [09:54:55] https://puppet-compiler.wmflabs.org/pcc-worker1002/37956/ml-staging-ctrl2002.codfw.wmnet/prod.ml-staging-ctrl2002.codfw.wmnet.err [09:56:23] ah, okay (re: wrong section) - the error we spotted but we have not yet figured out why it failed. the corresponding labs/private change has been merged already.. [09:57:01] or is it because the previous state fails to build. that would make sense indeed [09:58:52] jayme: ignore what i said the error is in the production run. let me give it a test with debug and also with my fix branch [09:59:43] jbond: production is "pre change" state, right? [10:00:08] yes [10:00:19] than it would make sense to fail [10:00:28] https://gerrit.wikimedia.org/r/c/labs/private/+/852196 <- I git mv'ed the files [10:00:31] Ooh, so the problem is that with the private change, the old state is bork [10:00:52] So it can't diff, because it can't look at what was [10:00:56] yes [10:01:01] Hrm. The joys of unsunced repos :) [10:01:09] ahh so yes this is sometyhing that i find a little confusing and would be happy to change but the issues is that [10:01:37] currently if things are prockin in HEAD but fixed in your patrch then hosts get placed in the no changes section [10:01:52] specificaly notice the bits in brackets "or compile correctly only with the change" [10:02:54] I presume there is no here-and-now fix for this, and we'll just hope that our no-op change is actually no-op. [10:03:24] like you said if its currently broken then there is no way for pcc to get a starting catalog to use in the diff [10:04:05] other then manualy cehcking the catalog i cant really think of anything [10:04:17] it's not currently broken. My labs/private change broke it because with git mv'ing the files the production state can no longer be rendered [10:04:21] I gues copying the files (instead of moving them) in the private/labs repo, PCCing, merging and then making a delete change would be the way to do this [10:04:27] ack [10:04:35] +1 [10:04:48] if you leave the files in place and remove them after the change ois merged [10:04:51] althoug I would keep the git mv (for git history) and copy to the old place [10:05:29] which then would be removed after successfull pcc [10:06:00] ack [10:06:40] No easy way back there now. Should we just merge 852158 and proceed? [10:07:41] I have the actual-private change stashed on the pm [10:08:46] as this is staging I guess you can proceed. Or, if you want a successfull pcc, add another change to labs/private that copies the moved files to the old location [10:09:16] Eh, I am (over)confident enough to give the direct route a try [10:09:27] +1 [10:09:55] last change to stop me from submitting 852158 [10:09:58] chance* [10:12:27] All merged, running puppet agent manuall on ml-staging-ctrl2001.codfw.wmnet [10:13:29] ...and of course it fails. [10:13:36] (same token error) [10:15:00] ah, found it. Pivate change was incomplete [10:18:32] mast applies cleanly, now testing worker [10:20:15] and worker is fine too [10:23:02] cool. Thanks! [10:23:07] jayme: I think this is all done now. [10:25:33] apergos: do we have dumps running now so I can check if it's fixed or not? [10:25:35] go [10:25:45] * jbond wrong window [10:26:34] they started running again yesterday evening, Amir1 and they seem to be running normally, but I did not check the conf boxes to see what's happening [10:26:35] * Amir1 hopes jbond runs git reset --hard in the wrong window [10:27:31] apergos: seems okay https://grafana.wikimedia.org/d/tTE9nvdMk/etcd?orgId=1&from=now-24h&to=now&viewPanel=4 [10:27:49] :) [10:28:04] good to know [10:28:24] let me know if any of them fails because of the other bug T322156. I think that would fix this one as well due to 1s TTL of loadmonitor [10:28:24] T322156: New errors during this month's full dump run: LoadBalancer.php: No server with index '4' - https://phabricator.wikimedia.org/T322156 [10:28:53] I'll be updating the task as the ru continues, yep [10:29:51] you saw Aaron's comment on the patch for that I guess [10:30:14] yeah, I'm gonna respond there soon [10:31:47] I'll be following along [10:35:04] apergos: can I quote you for T322360, or can you add a similar comment to that ticket- I think it is important [10:35:04] T322360: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 [10:35:49] quote me as saying what? I can comment there, just tell me which thing I said you want me to say again ;-D [10:36:11] the state of dumps running again [10:36:15] sure [10:36:27] thank you! [10:43:23] I scribbled something :-) [10:45:37] thank you, and one last question [10:45:43] if I may dare [10:46:08] the reason the logs started at 08:08 was because some batch process started hitting that codesbase there, right? [10:46:20] systemd timer [10:46:36] that is for the start of backups, the start of something particular? [10:46:41] sorry [10:46:43] *dumps [10:46:47] (my fault) [10:47:15] I want to reflect accurately what was the trigger (even if not the cause) [10:48:17] for the start of the dumps; the systemd unit starts a script that checks to see if the dumps are running, and if not, it will restart them, as mentioned on the task. this happens twice a day, exactly so that if we wind up shooting something or some job dies, it can just pick back up once the problem is fixed, with no further intervention needed [10:48:45] thanks, that's a great explanation [10:48:54] will add it to the doc [10:49:02] sure thing! [10:50:16] as this justifies a gap in our monitoring, as the high etcd load was there for many hours [10:50:53] and probably we should have caught this earlier, as elukey mentioned on ticket [10:51:17] yes, something like that probably should have been caught sooner, if only on the etcd end of things [10:53:08] indeed [11:06:13] I created T322400, as I think that is the big follow up [11:06:14] T322400: Monitor high load on etcd/conf* hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 [11:50:03] that kind of client should be ratelimited by etcd [14:41:36] let's discuss here I would say [14:41:42] +1 [14:41:44] denisse|m: --^ [14:41:53] ack. :) [14:42:05] ok [14:42:06] sure [14:42:32] denisse|m: are you doing IC? [14:42:36] I 'lle acknowledge alerts [14:42:40] proxy-server logging a lot of Handoff requested (27) (txn: tx21152036c4fb4551b9371-006365234e) [14:42:46] Here's the link to the current document: https://docs.google.com/document/d/12QY-N1oXRwY4tPHO0fwrvf2osvZnr-2Vjfl_3pAOjE4/edit?usp=sharing [14:42:54] bblack: Yes, I'm the IC. [14:43:07] splunk pages ACKed [14:43:22] And here's the doc from the similar incident a few days ago, for context: https://docs.google.com/document/d/1gAJhqBnDCQK6bLz61w1Mr9z2wYiSwMRuFkU3EYQgNyw/edit# [14:43:37] which means the proxy-server (thinks) it couldn't get an answer from the primary backend so tried a replica [14:43:43] reads are ramping up in codfw [14:43:46] Emperor: if swift-eqiad is depooled, is there any value in trying to restart a couple of swift-proxy services to see if they recover? (to validate the deadlock thesis) [14:43:54] looks like we also did a rolling restart of ms-fe frontends last time [14:43:55] (TTL is set to 300s for swift.discovery.wmnet IIRC) [14:44:00] (in eqiad) [14:44:01] I shared the link to the template, my bad. [14:44:13] Here's the link to the document: https://docs.google.com/document/d/1Gd98aR28A4dw6dsXf0lMpHOWZoBwEzgTieR_5m7NSuk/edit# [14:44:26] restarted on fe1010 [14:44:29] herron jelto jynus ^^ [14:44:38] I am getting around 80% high latency or failures on thumbs at https://commons.wikimedia.org/w/index.php?title=Special:NewFiles&offset=&limit=500 [14:45:46] does docker-registry also use swift underneath? [14:45:53] IIRC yep [14:46:03] yeah the docker images are stored on swift [14:46:16] ah, and in codfw only apparently, according to some random puppet comment [14:46:22] bblack: yes [14:46:36] still, it's also alerting [14:46:45] Emperor: on 1010 I don't see any 503 after the restart [14:47:04] docker-registry it's active/passive so I guess it's just the checks failing in eqiad [14:47:09] maybe we can roll restart leaving only one swift proxy alone for debugging? [14:47:28] we have some CRITs in icinga still on ms-feNNNN for: 1011, 1012, 2011, 2012 [14:47:29] only 2 files uploaded to commons since 14:29 [14:47:33] elukey: there are 3, you want to restart 1010 and we'll leave 1012 for now? [14:47:57] Emperor: 1011 right? I can take care of it [14:48:04] and 1012 should be depooled [14:48:04] * btullis here in case I can help with anything [14:48:09] elukey: yes, sorry [14:48:16] * Emperor looking in the 1010 logs [14:48:28] restarted swift-proxy on 1011 [14:48:54] shall I depool 1012? [14:49:14] I'd say anything we're not planning to restart, should depool [14:49:15] err... swift@codfw is also strugging [14:49:24] *struggling [14:49:27] lovely [14:49:33] on both sides [14:49:45] 20221104.14h48m57s CONNECT: Broken pipe [32] connecting to 10.2.1.27:443 for host='upload.wikimedia.org' url='https://swift.discovery.wmnet[....] [14:49:51] that's from cp3051 [14:50:02] and 10.2.1.27 is ms-fe.svc.codfw.wmnet. [14:50:18] if it's load-related then depooling half the frontends is going to risk pushing the other half over [14:50:23] see ms-fe2* on fire in -operations [14:50:31] we've only restarted 1010 and 1011, and depooled 1012, so far? [14:50:48] but we depooled eqiad on the discovery DNS record [14:50:55] I think someone depooled all of eqiad; what vgutierrez said [14:50:56] Emperor: should we repool eqiad side? [14:51:09] I'm inclined to think restart 1012 repool eqiad to try and spare codfw [14:51:21] ok, I'm ready to repool eqiad [14:51:24] let me know when Emperor [14:51:47] if so let's either depool 1012 or restart swift-proxy on it (if not already done) [14:51:49] restarted 1012 [14:51:52] ack [14:51:57] vgutierrez: I think repool eqiad now [14:52:05] done [14:52:18] (still trying to get to the right bit of logs on 1010) [14:52:40] [future followup: can we set swift and/or other disc records TTLs lower?] [14:52:51] people filing https://phabricator.wikimedia.org/T322417 I am going to add a status page incident [14:53:14] so around 14:28:16 I start seeing Nov 4 14:28:16 ms-fe1010 proxy-server: Timeout getting a connection to memcached: 10.64.16.92:11211 (txn: tx6ade2d9f5c83404fafe34-006365217f) [14:53:29] jynus: 👍 [14:53:44] godog: you around? [14:54:08] Emperor: yeah, that's just before the first probe alerts, so makes sense as a breadcrumb for getting at the root of this [14:54:28] thanks jynus, that should have been done much earlier [14:55:30] also possibly relevant: Nov 4 14:27:00 ms-fe1010 proxy-server: Error limiting server 10.64.16.92:11211 (txn: tx198847ad2e794f7982e02-0063652133) [14:55:41] Emperor: yes, reading backscroll [14:56:34] [we always seem to have a background level of Handoffs and ERROR with Objectserver ConnectionTimeout] [14:56:34] did I understand it right that the current mitigation is repooling eqiad after restarts? is that working ? [14:56:47] * akosiaris acking new pages [14:57:05] as things get worse, we start seeing Nov 4 14:28:43 ms-fe1010 proxy-server: ERROR with Account server 10.64.48.33:6002/sda3 re: Trying to HEAD /v1/AUTH_mw: ConnectionTimeout (0.5s) (txn: txdb120fe9de94496b974e1-0063652197) [14:57:44] godog: I think so [14:57:46] then the log-spam of Timeout getting a connection to memcached and Error limiting server really gets going [14:57:55] we saw the same errors about AUTH_mw this weekend [14:58:14] godog: I've repooled eqiad ~5 minutes ago.. so it should be effective ~now [14:58:23] ack, thanks vgutierrez Amir1 [14:59:07] but yeah the error limiting logs would suggest to me either swift gets confused talking to memcache or memcache itself is wedged someone, my hunch is on the former [14:59:11] errors are still high on ats side: https://grafana.wikimedia.org/d/000000479/cdn-frontend-traffic?orgId=1&viewPanel=14&var-site=All&var-cache_type=upload&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&from=1667570336897&to=1667573936897 [14:59:22] I see pybal errors for codfw frontend nodes, should we roll restart swift proxies in there too? [14:59:36] probably? [14:59:49] can't hurt I'd say if restarting fixed eqiad [14:59:58] yeah I think so, codfw still seems to be throwing a lot of errors [15:00:02] ok doing it [15:00:06] all three ms-fe in codfw are still alerting in icinga too [15:00:08] ack [15:00:48] godog: is memcached on the proxies a standard swift thing, or WMF-specific? [15:00:49] `elukey@cumin1001:~$ sudo cumin 'ms-fe2*' 'systemctl restart swift-proxy' -b 1 -s 20` [15:01:28] Emperor: standard in the sense that swift depends on memcached, if I understand the question correctly [15:01:50] I see recoveries for ms-fe2* [15:01:51] and does it have its own memcached cluster? if it's shared, any issue with memcached itself would bring down mw altogether [15:02:08] (so it probably means something on swift side probably) [15:02:14] yeah memcached is its own cluster, running on the ms-fe hosts [15:02:24] ok I see [15:02:29] Thanks [15:02:32] codfw fe restarted [15:02:35] ok... https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=upload&var-origin=swift.discovery.wmnet&from=now-3h&to=now&viewPanel=12 [15:02:38] that's looking better [15:02:50] sure, two indepentend clusters to be exact, one per eqiad/codfw [15:03:04] Special:NewFiles looking good now [15:03:22] ok to switch on status to "monitoring"? [15:03:23] not perfect.. but getting there :) [15:03:32] I think the incident is stopped now. [15:03:59] codfw swift cluster is still serving ~1.5krps of errors, btw [15:04:03] [another future followup: there seemed to be questions about all swift load on one side: are we sure swift has capacity at current load in general for core DC redundancy?] [15:04:16] one thing that caught my eye is https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=swift&var-instance=All&from=now-3h&to=now&viewPanel=55, ms-be1055 seemed having more rx bw usage than the rest [15:04:46] also commons uploads happening at good rate [15:05:31] (when y'all are out of incident, should I bin T322417 and/or for the future not log? Only did so in case the first timestamp gave a good place to start looking for the root cause) [15:05:33] T322417: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T322417 [15:05:36] upload is still over 500 errors per second globally [15:05:57] vgutierrez: is that on codfw? both? [15:06:15] is error noise like `journalctl -fu swift-proxy | grep ERROR`normal? e.g. ERROR with Object server 10.64.32.64:6024/sdq1 re: Trying to GET [15:06:17] jynus: eqiad, codfw, esams, ulsfo, eqsin and drmrs [15:06:25] :-( [15:06:31] some of which use each side of swift [15:06:52] (ulsfo+eqsin+codfw -> codfw, eqiad+esams+drmrs -> eqiad, when both are pooled) [15:06:58] indeed [15:07:22] godog: dumb swift question: what does a `HEAD /v1/AUTH_mw` call actually do? [15:08:04] picking a backend at random (since the proxy-server logged a timeout) ms-be1062 is under some load but not absurdly so; the container-server is using half a CPU and some of the object-servers are in disk-wait (load ~20) [15:08:20] per https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&var-DC=codfw&var-prometheus=codfw%20prometheus%2Fops [15:08:27] it looks like codfw is the one struggling right now [15:08:40] it's slowly recovering, I think [15:08:50] it looks like that [15:08:54] cdanis: mmhh I can't remember off the top of my head, is it happening often? could be stats gather if it isn't often [15:09:28] https://wiki.openstack.org/wiki/Swift/MemcachedConnPool is a bit concerning, but not desperately enlightening [15:09:35] godog: it's appeared in the logs early both times the cluster has fallen over, and, the vast vast majority of the error codes on the swift dashboard are HEAD calls [15:10:14] so as far as I understand, things are stabler, but I am waiting until there is full consensus that we are almost as good as before to update status [15:10:32] CDN is happy now [15:10:40] error rate back to normal [15:11:02] ok, cdanis from your view? as you were the other person to see also ongoing errors [15:11:22] cdanis: *nod* yeah I see it now on the dashboards, my guess would be clients e.g. mw trying to re-authenticate [15:11:46] yeah I'm wondering about symptom vs cause godog, like if we unwittingly create a very hot key once failures start [15:11:50] is it possible there's been a change in the MW (or other clients') behavior that's driving this new failure pattern? [15:11:51] BTW... as a follow-up... "Errors: writes" panel seems broken in https://grafana.wikimedia.org/d/000000584/swift-4gs [15:12:23] vgutierrez: Thanks, creating a ticket for that. [15:12:30] denisse|m: thx <3 [15:12:42] cdanis: good point, yeah that's quite possible things start snowballing [15:12:58] I don't really know anything though, I'm just making wild guesses [15:13:12] ok, switching to monitoring, as I think that best refers to current status [15:13:35] (we did reduce the number of backends, but that was a while back now - think about Prague time) [15:13:47] that's not that long ago [15:14:05] it's the frontends that seem like the problem, though [15:18:02] vgutierrez: Here's the link for the broken errors panel, I'll take a look at it after my on call shift: https://phabricator.wikimedia.org/T322418 [15:18:15] denisse|m: cheers, I got an email as well :) [15:18:22] denisse|m: ok to resolve on status page? [15:18:42] jynus: Yes, I think the issue is resolved now. [15:19:00] If swift-proxy starts getting timeouts connecting to memcached it'll start ignoring the "offending" memcached server. I wonder if that means that once load gets high enough you suddenly start making things worse by having swift-proxy start giving up on some of the memcacheds [15:19:25] Emperor: and the memcacheds are running on the same machines as the swift-proxies, right? [15:19:32] yes [15:19:44] Emperor: just to be clear, you are now on "root cause analysis mode", but not on immediate mitigation, right? [15:19:52] jynus: t [15:20:28] so I'm wondering if some combination of increasing the timeouts and/or increasing the allowed errors might reduce that effect [15:20:49] I think we're legit just underprovisioned on swift frontends [15:21:02] https://i.imgur.com/dvvgaBf.png [15:21:38] https://i.imgur.com/QUDpjqP.png [15:21:41] it's the ms-be servers that are making all the connections to the memcached's on the ms-fe nodes [15:21:49] [new info to me anyways, maybe was obvious!] [15:21:58] ah thanks bblack [15:22:05] bblack: these errors connecting to memcached are coming from the fe nodes [15:22:23] I was just looking at the ESTABLISHED connections to ms-fe1010's memcached, from lsof [15:22:38] but yeah, there are fe nodes in the mix too, you're right [15:22:40] ummm [15:22:41] and localhost [15:22:46] https://i.imgur.com/CLDJzas.png [15:23:00] there is *definitely* some sort of positive feedback loop in here around retries-during-failure [15:23:13] looks like all the ms-fe in the dc, all the ms-be in the dc, and localhost [15:23:18] I think we have an actionable for ATS as well here. ATS flags the origin server as down and remembers that for 60 seconds... and that's definitely not nice for a service that's behind a VIP [15:23:35] same pattern in eqiad https://i.imgur.com/Sp3Q1Gc.png [15:23:39] vgutierrez: yeah that's a good followup, should be tunable and this is probably broadly applicable [15:23:46] cdanis: that might be swift-proxy giving up on the local memcached and suddenly trying to talk to the other frontends' ones? [15:24:49] underprovisioned> at the risk of buggering my desire to make a Ceph cluster on MOSS, we do have moss-fe{1,2}00{1,2} that could be stolen and turned into 2 extra swift frontends per DC if cdanis is correct [15:25:23] is there any reason why swift FEs can't just be ganeti VMs? [15:25:24] but I'm wondering if we should make swift a wait longer and be more relaxed about memcached timeouts first. Thoughts godog, who knows swift well [15:25:39] cdanis: I don't see why not [15:25:59] what is swift using mc for? indices on which be nodes hold which objects? [15:26:44] it seems odd to me that memcached and swift-proxy are colocated, but evidently that is the recommened approach [15:26:56] cdanis: a ton of bandwidth I'd say [15:27:01] XioNoX: topranks: what happened to https://turnilo.wikimedia.org/#network_flows_internal ? [15:27:09] with your permission, I think Swift should be the topic for monday postmortem, unless you tell me you need more time [15:27:15] re: talking to other memcached, all ms-fe talk to all memcached in a consistent hash ring [15:27:26] bblack: auth sessions and rate limits IIRC [15:27:30] ok [15:27:40] T322420 created [15:27:41] T322420: ATS flags origin servers as down during 60 seconds after a connect timeout - https://phabricator.wikimedia.org/T322420 [15:28:05] so that explains the connection between the MW auth failures + memc issues, although cause:effect is still not clear to me [15:28:32] but yeah re: underprovisioning moving moss-fe to ms-fe would be a good bandaid I'd say, specs are the same IIRC [15:28:44] cdanis: both of us are off today, is there an ongoing issue you need it for? [15:28:44] sorry, dumb question, I know this was answered before -- why are ms-fe1009 and ms-fe2009 depooled? [15:28:58] stretch hosts [15:29:00] they were depooled last time because they're the only ones left running debian stretch I think [15:29:00] cdanis: they're still running stretch, because swiftrepl is stretch-only [15:29:11] 😬 [15:29:24] ...I've been yak-shaving about trying to get our rclone-based replacement for swiftrepl ready to deploy [15:29:28] cdanis: I'm on vacations but maybe the answer is https://phabricator.wikimedia.org/T308778#8119297 [15:29:32] cdanis: That network_flows_internal might be on me, I think. T308778 [15:29:32] T308778: Fix turnilo after upgrade - https://phabricator.wikimedia.org/T308778 [15:29:36] Might be related to renumbering of netflow1002 last week although I thought everything was back working [15:29:41] ah thanks XioNoX btullis [15:29:57] ah [15:30:06] okay! [15:30:33] https://i.imgur.com/V6hWGdi.png taking ms-fe1009 out of service seems to have increased CPU load on the others to a dangerous level [15:30:33] yak> https://salsa.debian.org/debian/pristine-tar/-/merge_requests/9 (needed so it's possible to try and build rclone 1.60, which contains the fixes we need) [15:30:37] cdanis: I’m away from desk but if there is an emergency feel free to call [15:31:23] How hard is renaming nodes? Or, to put it another way, how much faff would it be to turn the four moss front-ends into ms-feXXXX? [15:31:30] Emperor: re: timeouts I'm not sure tbh [15:32:17] cdanis: interesting we removed the stretch nodes, because we thought they might be the cause of the issue, but perhaps their removal is making it more likely that we enter the death spiral [15:32:30] yeah... [15:32:37] I don't know either, and we did enter the spiral once beforehand [15:32:37] we don't generally rename nodes, it makes tracking history hard [15:32:54] but you could re-role them on their existing names, I would assume [15:33:06] bblack: I'm a bit wary of repurposing them as swift frontends without renaming, though, seems likely a bit of a footgun for the future? generally moss-* aren't in production [15:33:09] I donno, we've made exceptions before [15:33:21] re-role> surely [15:34:00] they'll need a reimage (currently running buster), but that ought to be straightforward [15:34:43] (does rather bugger up my MOSS plans, but I think those were getting punted to next quarter anyway) [15:35:20] any chance we have other unspoken for hardware that could be used? [15:35:26] I'd defer to dcops on the rename thing, I think they're the main ones with a stake in that debate [15:35:29] we used to keep spares, but I don't believe that we do anymore [15:35:32] I'm beginning to believe that maybe the simpler thing is just understanding if swift failing to use memcached is the issue [15:35:56] (re: tracking warrantys and hwfail and other dcops tickets by-hostname, although we do have asset tags and netbox records that would persist, I think) [15:36:37] There is some wikitech documentation on renaming here: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging I've never done it, but looks comprehensive. [15:36:56] cdanis: naming notwithstanding, making new swift proxies is a fairly Simple Matter of Puppet [15:37:04] :) [15:37:12] we even have docs! https://wikitech.wikimedia.org/wiki/Swift/How_To#Add_a_proxy_node_to_the_cluster [15:37:55] it does seem like a rename from one unique name to another new one wouldn't be that bad. The more-egregious case is when we re-use old names for new hardware. [15:39:23] btullis: Hm, bunch of things I've not done before there, feels a bit necky for 15:40 on a Friday [15:39:47] on the other hand, it might be the best way to avoid us being back here saturday and sunday :) [15:39:59] bblack: yeah I thought re-using the same hostname for different asset tags was always the concern, not necessarily changing the hostname of one asset tag [15:40:07] yeah... better to be working on a Friday than on a Sunday IMHO [15:40:10] bblack: we can reimage and repurpose without changing the names [15:40:15] I suggest re-pooling ms-fe1009 and/or migrating swift-repl to a ganeti instance [15:40:54] We coudl take the view that today's problems suggests that the stretch version isn't a biggie, and then repool the two strech instances again [15:40:58] I'm assuming the real reason for ms-fe1009 depool, was that swift-repl adds more load there (and also happens to only run on stretch), so that's why it fails first? [15:41:33] bblack: swift-repl doesn't run very often; when we had the incident the other day there was a slight smell that ms-fe1009 had fallen over first, so maybe was suspect. [15:41:43] ah ok [15:41:57] [contra that depooling it doesn't take it out of memcached; but the memccahed errors we saw today were to all the frontends] [15:42:00] here's what the initial failure looks like bblack: https://i.imgur.com/ItSSxqf.png [15:42:19] it did start off with ms-fe1009 overloading, although it looks like that happened with load that came from the otehr machines, for unclear reasons [15:42:31] the explosion doesn't really happen until ~21:35 [15:42:41] but ms-fe1009 is suffering starting an hour before that [15:43:45] I think we have options a) do nothing b) adjust swift's memcached config to be more relaxed c) repool the two stretch nodes d) repurpose moss-fe* as ms frontends e) d+rename them [15:44:03] I think given this is the second incident this week a) isn't a good idea [15:44:17] I like (c) and (b) together tbh [15:44:34] so peeking back at the Nov 1 incident and ms-fe1009 [15:44:44] +1 to not do a) with your same rationale [15:45:00] godog: you seem unsure about b); what approach(s) would you favour? [15:45:01] first thing in the doc timeline, from alerting, is: [15:45:02] 21:43 PROBLEM - Swift https backend on ms-fe1011 is CRITICAL [15:45:15] but just before that on ms-fe1009 in syslog, we have: [15:45:18] Nov 1 21:36:57 ms-fe1009 kernel: [10210771.100600] python3 invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0 [15:45:30] Emperor: if I can also ask for something, as weekend is incoming- could I ask for a clear mitigation procedure (e.g. rolling restart) in case deeper changes are not prefered today? [15:45:44] bblack: that's well after load increased on ms-fe1009 for whatever reason, though [15:46:02] Nov 1 21:48:24 ms-fe1009 kernel: [10211457.396253] Killed process 179544 (swift-proxy-ser) total-vm:4015008kB, anon-rss:3912908kB, file-rss:0kB, shmem-rss:0kB [15:46:10] I like c) - haven't looked closely enough at b) yet to be sure. d) seems ok to me too. [15:46:28] we need to understand what happened at 20:35 [15:46:42] my vote is for (d), and then (e) next week [15:47:36] Emperor: yeah with overall situation and timing I think for sure c) and might as well try b) as a stopgap/bandaid, couldn't hurt [15:47:46] jynus: our best approach (alas) is a rolling restart of the proxies; if already depooled then that's just systemctl restart swift-proxy (via cumin). If you've not depooled then sudo cumin -b 1 -s 5 O:swift::proxy 'depool && sleep 3 && systemctl restart swift-proxy && sleep 3 && pool' [15:47:58] Emperor: that's super useful [15:48:05] I will document on incident docs [15:48:25] bblack: when I looked on the frontends mid-this-incident they weren't touching swap at all [15:48:26] c then d, then maybe e next week? [15:48:27] that way if change is not enougn and reocurres we have a predefined plan if it pages [15:48:35] so many options! [15:49:00] I will documente suggested mitigation on both Incident docs [15:49:07] I'm happy for any of these as long as someone wants to do them today :) [15:49:09] could even not do all of d. Just re-roling one extra node per DC should be a good deal of headroom, esp with x009 back in service [15:49:18] +1 [15:49:44] OK, that sounds like I should at least re-image one moss-fe node per DC to bullseye. [15:49:47] I suggest also maybe sending an email at the end of your day, Emperor with decision + pointer to mitigations [15:50:14] does anyone object to repooling the stretch hosts? [15:50:32] I don't [15:50:41] godog: while I'm doing that, could you put a puppet CR together to bring them into service as proxies, please? You seem the person most familiar with the operation. this'd be moss-fe1001 and moss-fe2001 [15:50:51] [seems better to parallelise a bit] [15:51:35] [my part-completed rebase of the rclone patches is so doomed, le sigh] [15:51:57] I'm also happy to help with this Emperor, godog. I don't know swift-proxy as well as you, but feel free to ping if I can help. [15:52:10] we do run different version of swift-proxy & memcached on stretch vs bullseye which does seem a bit worrisome [15:52:32] jhathaway: on the other hand, we had been running in that configuration for longer than this configuration ;) [15:52:42] very true [15:55:54] reimage underway [15:56:14] x2 [15:56:53] I hear no objections to repool ms-fe{1,2}009, so I'm going to go do that [16:00:24] Emperor: ok I'll do that [16:00:27] btullis: thank you! [16:00:47] what's the tracking task we're using for this again ? [16:02:55] Not sure we have one as yet [16:03:00] jynus: email> to whom? [16:03:36] sre-at-large? [16:04:49] marking moss-fe{1,2}001 as active in netbox [16:06:29] I'll file a tracking task [16:06:35] <3 [16:08:04] Emperor: sre-at-large looks the right audience, yes [16:08:18] cdanis: let's reuse / rename the existing one [16:08:44] If too late, I will merge, not worries there: T322417 [16:08:45] T322417: FileBackendError: Iterator page I/O error. - https://phabricator.wikimedia.org/T322417 [16:09:12] ahh [16:09:14] sounds good [16:09:38] If we're using that one can we retitle it? [16:10:07] maybe "🔥swift-proxy is fine🔥" [16:10:09] yeah actually I'm going to file a new one and we can merge or make subtask [16:10:16] cdanis: +1 [16:10:22] the wikimedia-production-error template on phab isn't a good fit [16:13:07] ok I went with T322417 but that's easy enough to change/track down later [16:13:33] https://gerrit.wikimedia.org/r/c/operations/puppet/+/853325/ and https://gerrit.wikimedia.org/r/c/operations/puppet/+/853324/ [16:15:27] cdanis: let me know if you are editting it otherwise I will (phab doesn't handle conflicts well) [16:16:36] https://phabricator.wikimedia.org/T322424 [16:16:39] swift causes both direct upload errors but also exceptions and errors on text through mediawiki, which is why I was confused last time at first [16:16:47] thanks cdanis [16:16:49] ok, merging there [16:17:08] oh, you did :-) [16:17:09] I have to go in ~45 but happy to keep helping until that time [16:17:52] let me link to possibly another one [16:21:35] E_TOO_MANY_TABS_OPEN [16:22:57] eheh sorry I was writing and merging tasks while ignoring IRC [16:23:24] godog: those look good to me, thanks. I'll +2/merge once the reimages are done [16:24:30] Emperor: ok! https://gerrit.wikimedia.org/r/c/operations/puppet/+/853324/ should be safe to merge first, new nodes are added depooled to pybal IIRC [16:24:47] godog: depooled> that's my understanding too [16:24:48] then https://gerrit.wikimedia.org/r/c/operations/puppet/+/853325/ once we know things are working, to pool memcached [16:25:59] that's a "takes effect once other proxies restarted" thing isn't it? [16:26:19] that's correct yeah [16:26:46] moss-fe1001 reimaged OK [16:27:07] meant to be orthogonal to the other one, i.e. things should work the same even if moss-fe memcached isn't pooled [16:27:57] I know also that people dislike spredsheets, but in this case this is super useful to me to keep track of stuff: https://docs.google.com/spreadsheets/d/1EYbMt6xTCDBaWfrPgu8Z1a3CvYrxbH1uH4kVf8MOQfQ/edit#gid=638298398 [16:29:33] I need to go afk for a bit, but I'll be around later [16:30:09] hopefully dispatch give us a better version of that [16:32:48] now running puppet on the two moss fes [16:33:16] ack [16:33:38] I've added a temporary mitigation measure on top of both docs, but said to check email in case that changes later on [16:36:28] Emperor: the other bit to do I think is force-run puppet on ms-fe / ms-be to pick up ferm changes [16:36:51] godog: my previous notes think ms-fe* for new proxies [16:37:02] [but backends too unlikely to hurt :) ] [16:37:43] yeah backends for sure too, they'll want to talk to frontends :) [16:44:42] OK, both new moss frontends let me download the testing kitten pictures [16:45:24] So I'm going to merge the other puppet change now, and then run puppet on the other proxies before rolling-restart and pooling the new servers [16:47:21] Emperor: ack, SGTM [16:51:34] I have to go, I'm reachable though if sth goes badly sideways [16:51:45] godog: thanks [17:08:32] OK, those two moss frontends are fully in service as extra ms proxies [17:10:32] Emperor: Excellent. Good work! [17:11:26] I've also emailed sre-at-large in case 🔥 when I'm not around [17:12:10] Sorry for all the noise on quiet friday folks [17:15:23] thanks for the email, Emperor! [20:05:56] gerrit seems quite slow the last couple of hours. git fetches are going at ~80kB/s and web interface loading slow as well. [20:05:58] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=gerrit1001&var-datasource=thanos&var-cluster=misc&from=now-7d&to=now [20:06:10] maybe something/someone is draining the network? [20:07:24] (although probably correlation not cause, as those are back down and didn't peak over 100Mb) [20:13:55] e.g. pulling two days worth of puppet.git: [20:13:59] Receiving objects: 86% (3791/4407), 1.19 MiB | 42.00 KiB/s [20:14:03] taking a minute or two [20:14:25] progressively slowing down with every received byte [20:19:51] nothing weird on gerrit1001 switch port: https://librenms.wikimedia.org/device/device=161/tab=port/port=14502/ [20:28:07] Krinkle: I noticed issues with the web on Wednesday too [20:28:23] Absolutely crawling for about 10 minutes