[07:51:55] Amir1: I will take care of db1122's reimage [08:08:55] Emperor: re: chunked objects, I'm really not sure whether there are or not offhand : [08:08:58] :| [09:07:46] marostegui: do you currently have a schema change running against s2? the dbmaint map shows 3, but i think maybe those are done? [09:07:56] yes, they are done [09:09:03] ok great [09:15:13] rebooting ms-fe1012 before repooling it. [09:16:50] marostegui: i'm going to reboot db1115. i think i need to downtime: db2093, orchestrator alerts, the systemd alerts for all prometheus hosts. am i missing anything? [09:19:06] i think that's it yeah [09:19:17] 🤞 [09:26:08] fsck it, even more problems in wmf/rewrite.py /o\ [09:27:28] I'm coming to think having our own home-grown middleware was a poor life choice :( [09:27:40] heresy [09:31:45] marostegui: looks like it worked \o/ [09:33:20] great! [09:33:54] marostegui: so.... parsercache reboots. they don't need a scheduled maintenance window, right? we 'just' do the swap to the spare, reboot the primary, swap back [09:34:04] yeah, that is [10:29:02] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 19.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [10:30:00] I haven't checked, but normally m1 lag is caused by etherpad [10:30:16] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [10:38:47] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 57 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [10:38:47] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 56.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [10:41:35] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [10:43:05] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [10:43:55] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/786290 to fix the 404 handler? I see at least one more different class of error in the logs (on an experimental repool), but it's in a different bit of rewrite.py, so I think get this fixed and I@ll look at the next thing [10:47:36] Emperor: ack, yeah change LGTM [12:07:22] Amir1: I have finished the root grants ticket :) [12:31:23] marostegui: how does this look? T306892 [12:31:23] T306892: Reboot pc1011 - https://phabricator.wikimedia.org/T306892 [12:32:28] Ah, I thought it was failed or something [12:32:34] WHen I saw just one ticket for it XD [12:32:46] :) [12:32:56] It should be fine I think [12:33:13] there's no point in messing around with replication for pc2011 i reckon [12:33:25] You also need to downtime pc1014 though [12:33:41] oh, yeah, good point. [12:34:08] updated [12:34:38] kormat: the apt full-upgrade is done already? [12:34:55] i... hum. have not been doing that. [12:35:13] I normally do that before any reboot, just in case we have many packages pending to upgrade [12:35:19] it doesn't hurt [12:36:21] i'll update my script to add that. [12:36:27] sounds good [12:39:01] once i've done pc1, i'll move pc2014 to pc2, and wait a few days [12:39:06] *pc1014 [12:39:22] cool [12:39:35] once everything is done, you probably want to fully truncate pc1014 [12:39:40] so it starts clean [12:40:29] hmm. pc1014 has mariadb 10.4.24+deb11u2 installed, but apt wants to _downgrade_ it to 10.4.22+deb11u2 [12:40:31] is that legit? [12:40:39] yes [12:40:58] see my email from 25th March [12:41:25] gotcha [12:41:40] I am actually going to ping mariadb to see if they have any update [12:42:50] godog: see https://phabricator.wikimedia.org/P26548 (NDA) for a pattern I'm seeing in proxy-server log ; the user-agent is unusual and stretch swift returns 412 Precondition Failed. bullseye explodes on what I think is mangled UTF-8 at the end of the string [12:53:09] Emperor: fascinating! but yeah looks like the client is sending garbage (?) [12:54:25] yes, it's malformed UTF-8, so maybe having an exception raised internally and saying 400 to the user isn't actually bad? other than the disk space taken up by the exception-barf... [12:55:21] it's just there are quite a few (ms-fe1011 has said 412 1308 times today) [12:56:01] Maybe the answer is to stop trying to log the garbage-UTF8? [12:56:55] we currently do self.logger.exception('Failed to decode request %r', env) [12:58:00] godog: do you have a feeling about that? the exception logging is quite verbose, but maybe that's just the way things have to be [12:58:56] Emperor: agreed on 400-ing back to the client, I don't feel strongly re: logging the whole request on exception at least by default [12:59:38] marostegui: well, this CR is not my finest work 😅 [12:59:38] godog: [I guess the alternative would be to add a new bit of code that checks the incoming path_info turns into valid utf-8 and bails earlier, which would let us differentiate "client sent garbage" from "we tried to rewrite and failed] [12:59:40] to make sure I understand right, once we're replying 400 to the client on decode error we'd be still hitting that code path ? i.e. no exception [12:59:45] kormat: :) [13:00:31] Emperor: yeah IMHO if the client is sending garbage we should 400 and not continue [13:00:41] godog: after we log the exception, we send swob.HTTPBadRequest('Failed to decode request') [13:02:00] godog: https://github.com/wikimedia/puppet/blob/146383fea51faba2d2553af9567bda36df049fc8/modules/swift/files/python3.9/SwiftMedia/wmf/rewrite.py#L208 and following [13:03:34] Emperor: ack, ok so bullseye explodes while attempting to log ? [13:04:35] marostegui: i'll wait until after the backport window before pushing this [13:04:48] godog: not quite - line 211 is the UnicodeDecodeError handler, which logs the backtrace (verbose!) to server.log (line 212), then sends bad request to the client [13:05:43] kormat: cool [13:06:00] Emperor: I see, yeah +1 to logging less verbose in that case then [13:06:33] the log ends up a bit confusing because the backtrace is from the code on line 233 (doing the // removal), which is mojibake-safe iff input is valid utf-8 [13:07:31] godog: hence I think we should sanity-check that the path_info is valid before we start doing anything with it (and probably just say 400 back to the user), and leave the exception handler to pick up explosions from actual re-writing, if you see what I mean? [13:08:03] Emperor: I do, makes sense to me [13:43:08] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/786311 ? [13:48:02] Emperor: neat! LGTM [13:57:51] repooling ms-fe1012, watching for more 🔥 [14:00:03] * kormat stands back [14:01:27] server.log errors now just look like those on stretch (if a smidge more verbose); client disconnects and some timeouts talking to backends [14:01:38] marostegui: thank you!!! [14:19:46] godog: ms-fe1012 seems mostly good, but. There are a small trickle of 500 errors (cf https://grafana.wikimedia.org/d/000000584/swift-4gs?orgId=1&var-DC=eqiad&var-prometheus=thanos&from=now-30m&to=now-1m ) [14:30:40] (but when I try and reconstruct a curl rune to retry, I end up with 429 instead on both stretch and bullseye) [14:34:09] godog: e.g. https://phabricator.wikimedia.org/P26556 - AFAICT my curl is doing the same request but getting 429 instead of the 500 originally reported. [14:34:42] [it's a tiny fraction of requests, but still] [14:35:41] probably unrelated, but does it also happen when you customize the user agent (-A)? [14:36:48] yeah, still 429 [14:37:12] oh, maybe I need to un-url-encode the user-agent [14:37:20] it was worth trying to search for an unknown filter :-/ [14:38:01] (even if I know those are generally only at cache level) [14:38:26] even if I un-url-encode the user-agent, still 429 [14:41:31] godog: also, no obvious server.log entry to match the 5xx [14:52:15] Emperor: mmhh ok, yeah no 500 in server.log means it could be from thumbor [14:52:37] Emperor: which makes sense for the subsequent 429 because thumbnails that repeatedly fail get rate limited [15:01:29] godog: ah, OK, so probably not something to be worrying about here, then, interms of the bullseye upgrade? [15:02:57] Emperor: I'd say so too yeah, especially the "no 500 in server.log" clue [15:07:04] 👍 [15:34:26] > didn't show up in the original report given the fact that it picks up a wiki at random (in order to avoid s3 ruining the report). [15:34:30] Amir1: can you explain^? [15:34:58] kormat: the script that builds the report for https://drift-tracker.toolforge.org/ [15:35:10] it picks up a wiki from each section at random [15:35:19] but.. why? [15:35:25] it doesn't matter for s8 or s1 [15:35:56] s3 has 1k wikis, not just it would take forever to build the report, any sort of drift in s3 basically adds 1k drift to the report [15:36:12] would make the rest of sections useless [15:36:49] so instead it may be incomplete, and thus also useless in a different way? [15:37:27] if you know it's incomplete and make sure to apply that, it wouldn't be useless [15:38:11] originally most of the drifts were per section not per wiki [15:38:19] e.g. missed schema changes [15:39:02] i certainly didn't know it was incomplete :) that seems like something that should be pretty prominently displayed [15:40:15] yeah, I have been thinking of a way to improve it. One way is to pick five wikis at random [16:26:29] Hi all, going do the reimage of clouddb1021 again today to finish off https://phabricator.wikimedia.org/T299480