[09:29:21] marostegui: are you currently running a schema change against the s7/codfw primary? [09:29:37] it is running against s7 [09:29:53] and yes, it seems it is now running on db2121 [09:29:56] ok, i'll take that as a 'yes' for my purposes [09:30:06] it shouldn't take long, I can ping you when done [09:30:11] perfect, thanks :) [09:30:49] marostegui: also, i was trying to figure out what's needed to reboot an x2 primary. as it's (still?) not in production use, is it ok to just downtime the entire section and do the reboot? [09:31:17] kormat: yes, that should be fine. I would doubel check with Timo to be fully sure it is not used for anything [09:31:57] ack, ping'd him. [09:32:21] otherwise it's a primary switchover, with a db-move-replica first, i'm guessing [09:32:41] (to move the other dc's primary to the primary candidate) [09:33:37] yes, but I am sure it is still not used [09:33:43] But let's wait for him to fully confirm [09:33:56] 👍 [10:33:12] nice- after reimage, the media backups user has changed UID, so I have to change the properties of 100 million files [10:33:38] ^ moritzm this doesn't scale for stateful services [10:37:04] kormat: you can go with s7 codfw primary [10:37:10] great, thanks. [10:40:25] marostegui: can you exit your shell that's in /srv on db2121 pls? [10:41:19] ah yes sorry [10:41:28] done [10:41:55] thanks :) [10:54:19] jynus: I did an installer bodge for swift nodes to avoid this issue, might be worth pinching as an idea? [10:55:06] jynus: puppet commit 26cc5af93da3974d526a134d57962ed7f116af02 [10:55:26] jynus: FYI mori.tz is OOO [10:55:33] I mean, UIDs can be changed- but this one was taken over by debmonitor user, and I am unsure to touch that [10:55:56] you'll probably want to pick a fleet-wide standard (like we are migrating to with swift painfully slowly) [10:56:17] Emperor: I would be interested in a solution not just for me [10:57:07] broadly, I get the impression that stateful services where we try and preserve filesystem contents across a reimage are the small minority [10:57:20] yeah, that is why we should support each other :-D [13:30:46] godog: I'm doing proxy upgrades first, but when it comes to backend reimages, the sre.hosts.reimage cookbook has a warning "All data will be lost unless a specific partman recipe to retain partition data is used."; AIUI the swift backend nodes do have such a recipe to preserve their spinning disks. Where would I check were I feeling paranoid? [13:31:35] Emperor: in puppet, in the partman recipe, give me a sec [13:31:53] in modules/install_server/files/autoinstall/netboot.cfg [13:31:59] TY [13:32:58] ah, yes 'ms-be[1-2]*) echo partman/custom/ms-be.cfg ;;' [13:33:12] yes, I was checking that recipe [13:33:14] Emperor: what volans said indeed, you won't see any spinning disks in the partman recipe which is why partman will ignore them [13:34:14] so yeah preserving by way of ignoring I suppose [13:34:25] is done in a different way than others [13:34:55] like the recipes that have 'reuse' in the name and use 'keep' [13:36:06] * Emperor is definitely not volunteering to go anywhere near touching partman recipes ;p [13:36:23] good call [13:36:33] but yeah reuse is by the graceful kormat [14:01:25] Amir1: https://phabricator.wikimedia.org/T304626#7845035 [14:02:16] marostegui: thank you so much [14:02:24] I'm still baffled what caused it [14:23:19] godog: I'm mid-reimage of ms-fe1012 and AFAICT it is already being fed queries even though it's pooled: inactive in confctl...?!? [14:24:21] 'confctl select cluster=swift,dc=eqiad get' e.g. has it inactive for both nginx and swift-fe [14:24:48] have you checked ipvsadm too to make sure of that? is there anything else that points to it directly outside LVS? [14:25:07] Oh, no, I am just stupid, I think [14:25:34] could be healthchecks (?) [14:25:50] it's alright, I was being stupid, ignore me and I might go away :) [14:26:40] hah no worries, please keep asking questions tho [14:27:50] waiting for puppet to do its thing, I think [14:30:13] Ah, found a bootstrap issue [14:39:00] hah, I bet puppet's cranky ? [14:51:38] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/779050 fixes it [14:51:41] +1? [14:53:16] Emperor: close [14:56:24] godog: oh, bother. [14:57:11] godog: in more problematic news, I had to unpool ms-fe1012 because of proxy-server errors [14:58:31] Emperor: hah, I'll take a quick look at the errors [14:58:33] Looks to be at least two different classes of problems [14:59:11] wmf/rewrite.py is failing on 'HTTPMessage' object has no attribute 'gettype' [15:00:07] also a bunch of value.encode('latin1').decode('utf-8')#012UnicodeDecodeError [15:00:49] indeed [15:03:20] (I'm going to go and re-fix the CR first, because I feel like I have a better chance of success there...) [15:04:24] Emperor: SGTM, re: the errors in the past I've used rewrite_integration_test.py in the 'swift' puppet module to test/fix similar errors [15:05:47] is this ms-specific and that's why thanos didn't pick it up? I wasn't expecting show-stoppers at this point :-/ [15:07:06] yeah rewrite.py is the middleware that takes care of thumbnails essentially [15:07:24] i.e. ms-specific [15:51:04] godog: I don't see how rewrite_integration_test.py is helping? If I run it on ms-fe1012 it doesn't seem to actually talk to the local swift FE, and I get a couple of 403s [15:51:46] or at least, I don't see the earlier errors again [15:54:29] Hm, it is talking to localhost, but the tests fail (two 403s rather than 302s) and it doesn't reproduce the errors we saw when the server was pooled :( [15:55:53] Emperor: yeah the 403s I believe because the purging policies changed, but aside from that I used the file to reproduce errors seen in the past, the errors you are seeing I'm assuming will need new tests [15:56:18] (jumping in a meeting in 5) [15:57:05] not compulsory of course to use rewrite_integration_test.py to reproduce the errors for sure [15:57:45] i.e. the 403s can be ignored (I think, assuming the same failures happen on existing hosts) [15:57:54] s/existing/stretch/ [15:58:12] Not quite sure _how_ to work out what caused the errors - there are 276 proxy-access.log entries matching the timestamp of one server.log error :( [16:03:55] like, I can find RAW_PATH_INFO in server.log, but I don't see how that gets me closer to finding out what the problem was and/or how one might make a test that checks a fix was effective [16:05:48] Ah [16:06:00] RAW_PATH_INFO = /wikipedia/commons/thumb/2/23/Drapeau_fr_d%C3%A9partement_Morbihan.svg/188px-Drapeau_fr_d%C3%A9partement_Morbihan.svg.png [16:09:02] godog: I know there's a meeting now (and then it's COB at least for me), but: https://phabricator.wikimedia.org/P24433 seems to suggest that we're making a bad assumption about encoding? That matches the error message I was seeing [16:09:05] #012 File "/usr/local/lib/python3.9/dist-packages/wmf/rewrite.py", line 378, in _handle_request#012 req.path_info = "/v1/%s/%s/%s" % (self.account, container, urllib.parse.unquote(obj))#012 File "/usr/lib/python3/dist-packages/swift/common/swob.py", line 811, in setter#012 value.encode('latin1').decode('utf-8')#012UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte [16:10:43] but obviously I've never looked at rewrite.py [16:11:07] Emperor: ah yeah that could quite well be it, though of course left to be validated if it is specific to bullseye / python3 or the same errors show up in stretch hosts too [16:11:19] also COB for me, will be off the rest of the week [16:13:47] godog: anyone else who might be able to help with this? obviously this has blocked starting the ms bullseye upgrade, which was already going to be tight [16:15:24] https://upload.wikimedia.org//wikipedia/commons/thumb/2/23/Drapeau_fr_d%C3%A9partement_Morbihan.svg/188px-Drapeau_fr_d%C3%A9partement_Morbihan.svg.png (made by the obvious dumb C&P) works [16:17:21] and the stretch hosts will be running a py2 version of the code [16:17:25] I think... [16:20:28] the equivalent of doing value.encode().decode('utf-8') works, but I'm so unfamiliar with this code, I don't feel safe doing that to prod... [16:21:18] in particular, that assumes that the unquoted URL is UTF-8 rather than Latin-1 that the current code is assuming. I'm not sure where that previous assumption came from [16:26:30] Hm, the error is maybe coming from File "/usr/lib/python3/dist-packages/swift/common/swob.py", line 811, in setter#012 value.encode('latin1').decode('utf-8')#012UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 62: invalid continuation byte [16:26:59] which is the next entry in the traceback. but I might be misreading the log [16:28:26] So I _think_ that swift/common/swob.py is assuming latin1 but it's actually utf-8? But how would that ever have worked? [16:29:06] * volans close both eyes to prevent bleeding [16:39:00] if six.PY2: [16:39:00] return wsgi_str [16:39:00] return wsgi_str.encode('latin1') [16:39:34] marostegui: 70GB removed from s7 (flaggedtemplates stuff) https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db1158&var-datasource=thanos&var-cluster=wmcs&from=now-1d&to=now [16:40:03] godog: So it looks like in py2, swift's swob.py just passes strings around inside wsgi, but in py3 it wants them to be latin1... but what happens if your URL contains characters not expressable in latin1? Which seems like something we might have [16:43:15] e.g. https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/%E5%8F%B0%E7%81%A3%E7%BE%A4%E5%B1%B1%E6%97%97.svg/120px-%E5%8F%B0%E7%81%A3%E7%BE%A4%E5%B1%B1%E6%97%97.svg.png [16:44:19] so I'm not really sure how this is meant to work [16:48:24] Could someone check for me whether the cuc_actor column exists in the cu_changes table of guwwiki? [16:48:31] same for shnwikivoyage? [16:48:50] (both are s5 wikis) [16:54:12] zabe: seems to exist on both at least on a random replica [16:55:16] thanks, that is good enough for me [17:02:09] Emperor: I'm not sure either tbh, haven't looked too deeply into swob in a few years now [17:03:44] :-( [17:04:59] Emperor: latin1 encoding is 8-bit clean, so any characters can fit in a latin1 string as long as you are using it as an opaque token. The only place you get into trouble when treating utf-8/utf-16 encoded strings as latin1 is when you want to treat the contents as code points rather than binary data. [17:10:05] Emperor: if talking about files uploaded, I would say 10-20% are not latin1-compatible [17:10:35] they are all utf-8-compatible, though (no invalid utf-8 file names) [17:14:01] ok gotta go, talk to you post-easter !