[01:05:44] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:07:58] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [09:11:32] (MysqlReplicationLag) firing: MySQL instance db1134:9104 has too large replication lag (2h 9m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1134&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [09:11:43] ^ me [09:26:32] (MysqlReplicationLag) resolved: MySQL instance db1134:9104 has too large replication lag (11m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1134&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [09:37:40] db1176 is now running MariaDB 11 :) [09:52:07] Amir1: https://phabricator.wikimedia.org/T326211#8500968 [10:21:46] oh nice [13:16:04] https://phabricator.wikimedia.org/T326309 is thumbnails on arywiki broken "since forever", suggesting (they think) a localsettings.php config error regarding swift setup. But I thought our swift credentials were in PrivateSettings.php (and the same everywhere)...? [13:18:00] Emperor: correct, we use same swifit credentials for all the wikis [13:19:23] the linked https://www.mediawiki.org/wiki/Topic:Xa48l2j22xg953la suggests the problem is (only?) with images uploaded directly to arywiki [13:20:32] [but I don't know about how localsettings.php is set up on the various wikis, nor where one might go looking to see what is wrong on ary] [13:32:57] Right, I've been digging and found /srv/mediawiki-staging/wmf-config/ [13:33:22] InitialiseSettings.php doesn't have any arywiki config that might affect swift [13:35:13] likewise CommonSettings.php [13:36:54] are the swift acls correct? [13:41:34] marostegui: objections to running a maint script that increases the size of s3 by maybe 50GB-100GB? Context T312666 (filling new fields before we can start removing old ones) [13:41:34] T312666: Remove duplication in externallinks table - https://phabricator.wikimedia.org/T312666 [13:41:48] Amir1: we should have plenty of space in s3 [13:41:50] let me check [13:41:56] Emperor: hmm. arywiki doens't have an EDP, so I'm not sure why it even has local uploads enabled [13:42:10] Amir1: yeah, 100GB shouldn't be a big deal [13:42:18] awesome [13:44:38] taavi: sorry, not sure what you mean by EDP (but maybe the problem is that it shouldn't have local uploads enabled at all?) [13:46:09] https://meta.wikimedia.org/wiki/Non-free_content#Exemption_Doctrine_Policy [13:49:58] Emperor: basically small wikis are not allowed to have local uploads anymore [13:52:43] Amir1: seems sensible; but is it possible/easy to find out if ary _should_ have local uploads ? [I'm hazarding a guess that the problem here might be that arywiki config and/or admins think it should but something between it and swift (or maybe swift) thinks otherwise?] [13:53:36] yeah, I need to check that [13:55:27] would appreciate it, I don't really feel like I know where to go chasing here [14:21:51] I think I created that wiki and I remember it was very uncooperative during its creation so I might have messed up something [14:26:38] Amir1: I think WikimediaMaintenance has a script to fix the swift access rules, do you think that could help? [14:35:25] (the right answer sounds like it might be to block local uploads rather than granting swift access, though?) [14:37:36] let me dig up the ticket for its creation [14:37:45] maybe that says if it should have been allowed or not [14:38:44] > Local file uploads: yes [14:38:45] ugh [14:41:02] sad times [14:47:59] Emperor: fixed ^_^ [14:48:28] Amir1: thank you <3 [14:49:50] Is it OK to use my staff account to comment on https://www.mediawiki.org/wiki/Topic:Xa48l2j22xg953la to say it's fixed? [14:50:13] sure thing [14:52:25] * Emperor has mostly internalised "don't edit anything with your staff account" which is obviously Not Right, but probably a sensible default ;-) [14:52:47] (and this is obviously a "wearing a jaunty WMF staff hat" moment) [15:15:58] mostly it's "don't edit Wikipedia et al content pages" [19:04:51] o/ I have a question for the mariadb experts in the room :) I have a toolsdb replica (clouddb1002) stuck on a single transaction for days, I want to skip that transaction but I'm not sure on the Best (tm) way. more details at T326261 [19:04:52] T326261: [toolsdb] clouddb1002 stopped replicating from clouddb1001 (again) - https://phabricator.wikimedia.org/T326261 [19:05:33] I discussed this earlier with maros.tegui but I thought I'd write here instead of pinging him directly as it's quite late in EU [19:06:35] yeah, I'll check once I figure out how to do this one thing [19:06:43] never did this before, I see there's sql_slave_skip_counter, which would probably fail because I have multiple domain_id in gtid_slave_pos. would adding 1 to gtid_slave_pos work? [19:06:45] s_t is discouraged [19:06:54] thanks Amir1 [19:09:53] it's a bit of an unusual case, the replica is already inconsistent and I just want it to resume replication for a little bit longer, until we replace it with a proper one [19:15:48] dhinus: my suggestion is to stop replication, check what's making the transaction quite slow, make it fast (e.g. truncate the table) and start replication back again [19:16:00] if that falls under acceptable loss [19:18:45] the transaction is (as far as I understand) deleting millions of lines from a table with no primary key [19:19:20] so I can't really think of a way to make it fast... and I'm ok with those lines remaining there, as a temporary measure [19:19:25] maybe you have to just wait for it to finish :/ once done, it'll be fast [19:19:39] it's been stuck for 10 days :D [19:19:52] okay 10 is a lot [19:20:12] how big is the table file? [19:20:19] lemmecheck [19:21:21] 664M [19:23:11] that's not too bad. copy the file to somewhere else, stop replication truncate the table start replication wait for that transaction to blow over, stop replication, copy the ibd file back [19:23:27] didn't test it though :D [19:23:57] he, I can try, but what happens after the file is back? [19:24:36] won't replication crash again? [19:25:00] I'm not following [19:25:14] if the table is empty, it won't delete anything [19:26:15] but after I copy the original ibd file back, will it still be empty? [19:27:01] I don't know enough about the mariadb internals really, I can give it a go. [19:27:35] ahhh you mean copying the file from the primary [19:27:40] it won't be empty but there won't be any replication delete action to do [19:27:43] right? [19:31:47] so there are multiple steps. on the first step, after I truncate the table, the problematic transaction might not even crash because it's doing DELETE WHERE, and it will just delete 0 lines, I think? [19:33:02] yup :D [19:33:20] it's horrible but I don't have a better idea [19:33:30] these tables should move to somewhere decent [19:33:34] then I stop replication again, and replace the ibd file in the replica with a copy from the primary [19:33:34] and refactored [19:33:55] don't copy from primary [19:34:38] why not? the file from the replica will have all the extra million lines, no? [19:34:56] you can do either of these: 1- Copy from the backup file you made but you have to stop replication right after you started it (and the problematic trasnaction is done) [19:34:58] or [19:35:40] you have to wait days for the replication to fully catch up (I assume there will be more changes after that transaction) then stop replication and then copy the file [19:35:58] yes I hope it won't be days but it will def be hours [19:36:20] with the flow of writes I've seen, I think it'll take a couple of days :D [19:36:32] yeah likely :) [19:36:48] in wmf prod it's 1:3 or 1:5 if the section is quiet but in toolsdb it's 1:1.5 even [19:36:50] so you would backup the ibd on the replica anyway [19:37:46] oh definitely [19:38:15] my worry is that right now the replica file contains 4 million rows, and ideally I'd like in the end to have only 1 million (as in the primary) [19:38:42] if I copy back the current replica file, won't I get back to 4 million rows in the replica? [19:40:00] yup, you would, so the second way of doing it seems better [19:40:21] I didn't know how massive the deletion was [19:41:52] second way = wait for replication to fully catch up? [19:43:09] yup [20:58:05] I'm not really here but copying and moving ibd just like that won't work and will corrupt innodb tablespace entirely [20:58:25] I can talk to you on Monday dhinus about how to skip that transaction [20:58:29] as I'm off tomorrow [20:58:42] if you want to backup the table do a mysqldump [20:58:51] but don't copy the .ibd file [20:59:57] thanks marostegui , we continued the conversation with Amir1 in private. we found another way that fixed the immediate problem... [21:00:13] (we just added the table to the exclusion list Replicate_Wild_Ignore_Table) [21:00:43] I'm writing a report in phab of what I did :) [21:01:03] I'm also off tomorrow, but I will keep an eye on IRC [21:03:40] thanks Manuel, sorry about that part! [21:06:42] https://phabricator.wikimedia.org/T326261#8503035