[01:47:05] mutante: it turns out there is an issue with backups, they are stuck (not fully, as it works when volumes expire), but it is not healthy. I ignored it because I thought it was a temporary delay, but it turns out to be something ongoing (config or storage issue) [03:49:36] jynus: ok, thank you [09:17:56] hello, I could use a bit of help on db2205, I'm having that processlist: https://phabricator.wikimedia.org/P68764 that appear to trigger a stale state of the server and its child replicas (it's the candidate master of an ongoing -but paused- switchover) [09:18:30] I've tried to kill orchestrator's query to see if it had an impact, but no [09:18:46] I'd be tempted to restart the instance (its depooled) [09:19:19] `Slave_SQL` thread is stuck on `Commit` [09:19:37] which makes me think that a restart would be a good fix [09:19:50] (cc jynus Amir1 if you're around) [09:21:01] disable semisync [09:21:29] I have to restart replication for this [09:21:40] which brings me back to the stuck thread [09:22:10] stop replication yeah [09:22:19] but mariadb doesn't want me to :-( [09:23:12] and I don't want to try and kill the SQL thread's transaction stuck in commit state as it looks like a bad idea from a distance [09:23:30] yeah, kill the db, reload the data [09:23:48] it will likely be corrupted as you force kill it [09:24:10] ack, I'll restart mariadb instead that seems better indeed [09:38:41] please create a ticket [09:39:27] and if you are doing https://phabricator.wikimedia.org/T374421 please stop [09:39:43] it is not ok to promote a master just after it crashed [09:40:41] totally, I was creating it indeed and the switchover is very much halted :D [09:40:52] rollback as much as you can [09:41:02] and create a ticket to see what's next [09:41:42] it is not ok to continue if there is even a minimal chance that data could had been affected, as the host was busy committing [09:42:13] here the chance is more than minimal I'd say :D I would not have recommended to continue on this one, I've told netops already so they skipped the master that will have to stay like this for now [09:42:44] but please create a ticket about the issue, people must be aware [09:42:47] sure [09:43:14] most likely there was some kinf of issue with semisync [09:43:37] with it being expected but not properly configured or something else [09:44:02] it did not looked like it, as usually it does not prevent fixing the semisync issue (meaning I was totally unable to issue a single STOP SLAVE because of the stuck thread) [09:44:22] but that was my first guess indeed [09:44:30] was there maintenance ongoing? [09:44:42] anyway, those are the kind of debugging that can happen on a ticket [09:44:52] not declared on https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [09:50:12] T374425 [09:50:12] T374425: Reimage db2205/db2107 - https://phabricator.wikimedia.org/T374425 [09:54:12] if you can isolate the transaction that was stuck when commit, based on the binlog [09:54:31] we can see if the writes went through/retried correctly doing some comparison [09:55:13] e.g. comparing eqiad and codfw of the right tables [09:55:38] but sadly, if a host crashes but doesn't have gtid enabled, it is likely to get corrupt on crash [09:56:05] which happen to be only the case of the secondary master [09:56:29] (the primary one can crash, bit it will always be consistent with itself :-P) [09:57:40] if dc ops can guarantee <10s of lag, it wouldn't be terrible to do it live if needed for the secondary, if needed [09:59:04] yeah but they can not guarantee it as they depend on cable quality that can be erratic according to Murphy's law [09:59:24] then yes, better slow down, make sure things are healthy [09:59:47] Manel told us he knew why the problem happened [09:59:54] but it didn't told me [09:59:57] *he [10:00:30] can you double check the plugins configured on both hosts? [10:00:39] I will ask that on the ticket [10:00:47] sure, lets coordinate there! [10:01:03] hosts are puppetized, it'll be wierd they have different plugin config [10:01:14] well, but as one is promoted [10:01:26] it changes roles, which requires a reboot [10:01:32] maybe that was missed [10:01:35] or something [10:01:44] idk [10:02:06] me neither atm :D [10:02:25] I'll go fix myself lunch, bbiab [12:56:48] FIRING: [2x] MysqlReplicationLagPtHeartbeat: MySQL instance db2205:9104 has too large replication lag (11m 19s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db2205&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLagPtHeartbeat [12:57:01] downtiming, sorry for the noise [14:10:24] db2127 is back on its proper topology level, replag is catching up, waiting for feedback on db2205 to reimage it [15:38:39] urandom: hey, just a reminder we're doing network maintenance in codfw starting at the top of the hour [15:38:51] Matthew asked you to check a few nodes afterwards [15:38:52] https://phabricator.wikimedia.org/T373097#10126233 [15:38:57] I'll let you know when we're done [15:44:03] topranks: 👍 [16:17:24] urandom: that's us done, hosts are all responding to ping again anyway [16:17:58] topranks: thanks, I'll look over the swift backends [16:24:02] (and for posterity sake, everything looks good 🙂)