[01:05:44] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:07:58] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[09:11:32] <jinxer-wm>	 (MysqlReplicationLag) firing: MySQL instance db1134:9104 has too large replication lag (2h 9m 2s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1134&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[09:11:43] <marostegui>	 ^ me
[09:26:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: MySQL instance db1134:9104 has too large replication lag (11m 58s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=db1134&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[09:37:40] <marostegui>	 db1176 is now running MariaDB 11 :)
[09:52:07] <marostegui>	 Amir1: https://phabricator.wikimedia.org/T326211#8500968
[10:21:46] <Amir1>	 oh nice
[13:16:04] <Emperor>	 https://phabricator.wikimedia.org/T326309 is thumbnails on arywiki broken "since forever", suggesting (they think) a localsettings.php config error regarding swift setup. But I thought our swift credentials were in PrivateSettings.php (and the same everywhere)...?
[13:18:00] <taavi>	 Emperor: correct, we use same swifit credentials for all the wikis
[13:19:23] <Emperor>	 the linked https://www.mediawiki.org/wiki/Topic:Xa48l2j22xg953la suggests the problem is (only?) with images uploaded directly to arywiki
[13:20:32] <Emperor>	 [but I don't know about how localsettings.php is set up on the various wikis, nor where one might go looking to see what is wrong on ary]
[13:32:57] <Emperor>	 Right, I've been digging and found /srv/mediawiki-staging/wmf-config/
[13:33:22] <Emperor>	 InitialiseSettings.php doesn't have any arywiki config that might affect swift
[13:35:13] <Emperor>	 likewise CommonSettings.php
[13:36:54] <taavi>	 are the swift acls correct?
[13:41:34] <Amir1>	 marostegui: objections to running a maint script that increases the size of s3 by maybe 50GB-100GB? Context T312666 (filling new fields before we can start removing old ones)
[13:41:34] <stashbot>	 T312666: Remove duplication in externallinks table - https://phabricator.wikimedia.org/T312666
[13:41:48] <marostegui>	 Amir1: we should have plenty of space in s3
[13:41:50] <marostegui>	 let me check
[13:41:56] <taavi>	 Emperor: hmm. arywiki doens't have an EDP, so I'm not sure why it even has local uploads enabled
[13:42:10] <marostegui>	 Amir1: yeah, 100GB shouldn't be a big deal
[13:42:18] <Amir1>	 awesome
[13:44:38] <Emperor>	 taavi: sorry, not sure what you mean by EDP (but maybe the problem is that it shouldn't have local uploads enabled at all?)
[13:46:09] <taavi>	 https://meta.wikimedia.org/wiki/Non-free_content#Exemption_Doctrine_Policy
[13:49:58] <Amir1>	 Emperor: basically small wikis are not allowed to have local uploads anymore
[13:52:43] <Emperor>	 Amir1: seems sensible; but is it possible/easy to find out if ary _should_ have local uploads ? [I'm hazarding a guess that the problem here might be that arywiki config and/or admins think it should but something between it and swift (or maybe swift) thinks otherwise?]
[13:53:36] <Amir1>	 yeah, I need to check that
[13:55:27] <Emperor>	 would appreciate it, I don't really feel like I know where to go chasing here
[14:21:51] <Amir1>	 I think I created that wiki and I remember it was very uncooperative during its creation so I might have messed up something
[14:26:38] <taavi>	 Amir1: I think WikimediaMaintenance has a script to fix the swift access rules, do you think that could help?
[14:35:25] <Emperor>	 (the right answer sounds like it might be to block local uploads rather than granting swift access, though?)
[14:37:36] <Amir1>	 let me dig up the ticket for its creation
[14:37:45] <Amir1>	 maybe that says if it should have been allowed or not
[14:38:44] <Amir1>	 > Local file uploads: yes
[14:38:45] <Amir1>	 ugh
[14:41:02] <Emperor>	 sad times
[14:47:59] <Amir1>	 Emperor: fixed ^_^
[14:48:28] <Emperor>	 Amir1: thank you <3
[14:49:50] <Emperor>	 Is it OK to use my staff account to comment on https://www.mediawiki.org/wiki/Topic:Xa48l2j22xg953la to say it's fixed?
[14:50:13] <Amir1>	 sure thing
[14:52:25] * Emperor has mostly internalised "don't edit anything with your staff account" which is obviously Not Right, but probably a sensible default ;-)
[14:52:47] <Emperor>	 (and this is obviously a "wearing a jaunty WMF staff hat" moment)
[15:15:58] <Reedy>	 mostly it's "don't edit Wikipedia et al content pages"
[19:04:51] <dhinus>	 o/ I have a question for the mariadb experts in the room :) I have a toolsdb replica (clouddb1002) stuck on a single transaction for days, I want to skip that transaction but I'm not sure on the Best (tm) way. more details at T326261
[19:04:52] <stashbot>	 T326261: [toolsdb] clouddb1002 stopped replicating from clouddb1001 (again) - https://phabricator.wikimedia.org/T326261
[19:05:33] <dhinus>	 I discussed this earlier with maros.tegui but I thought I'd write here instead of pinging him directly as it's quite late in EU
[19:06:35] <Amir1>	 yeah, I'll check once I figure out how to do this one thing
[19:06:43] <dhinus>	 never did this before, I see there's sql_slave_skip_counter, which would probably fail because I have multiple domain_id in gtid_slave_pos. would adding 1 to gtid_slave_pos work?
[19:06:45] <Amir1>	 s_t is discouraged
[19:06:54] <dhinus>	 thanks Amir1
[19:09:53] <dhinus>	 it's a bit of an unusual case, the replica is already inconsistent and I just want it to resume replication for a little bit longer, until we replace it with a proper one
[19:15:48] <Amir1>	 dhinus: my suggestion is to stop replication, check what's making the transaction quite slow, make it fast (e.g. truncate the table) and start replication back again
[19:16:00] <Amir1>	 if that falls under acceptable loss
[19:18:45] <dhinus>	 the transaction is (as far as I understand) deleting millions of lines from a table with no primary key
[19:19:20] <dhinus>	 so I can't really think of a way to make it fast... and I'm ok with those lines remaining there, as a temporary measure
[19:19:25] <Amir1>	 maybe you have to just wait for it to finish :/ once done, it'll be fast
[19:19:39] <dhinus>	 it's been stuck for 10 days :D
[19:19:52] <Amir1>	 okay 10 is a lot 
[19:20:12] <Amir1>	 how big is the table file?
[19:20:19] <dhinus>	 lemmecheck
[19:21:21] <dhinus>	 664M
[19:23:11] <Amir1>	 that's not too bad. copy the file to somewhere else, stop replication truncate the table start replication wait for that transaction to blow over, stop replication, copy the ibd file back 
[19:23:27] <Amir1>	 didn't test it though :D
[19:23:57] <dhinus>	 he, I can try, but what happens after the file is back?
[19:24:36] <dhinus>	 won't replication crash again?
[19:25:00] <Amir1>	 I'm not following
[19:25:14] <Amir1>	 if the table is empty, it won't delete anything
[19:26:15] <dhinus>	 but after I copy the original ibd file back, will it still be empty?
[19:27:01] <dhinus>	 I don't know enough about the mariadb internals really, I can give it a go.
[19:27:35] <dhinus>	 ahhh you mean copying the file from the primary
[19:27:40] <Amir1>	 it won't be empty but there won't be any replication delete action to do
[19:27:43] <Amir1>	 right?
[19:31:47] <dhinus>	 so there are multiple steps. on the first step, after I truncate the table, the problematic transaction might not even crash because it's doing DELETE WHERE, and it will just delete 0 lines, I think?
[19:33:02] <Amir1>	 yup :D
[19:33:20] <Amir1>	 it's horrible but I don't have a better idea
[19:33:30] <Amir1>	 these tables should move to somewhere decent 
[19:33:34] <dhinus>	 then I stop replication again, and replace the ibd file in the replica with a copy from the primary
[19:33:34] <Amir1>	 and refactored
[19:33:55] <Amir1>	 don't copy from primary
[19:34:38] <dhinus>	 why not? the file from the replica will have all the extra million lines, no?
[19:34:56] <Amir1>	 you can do either of these: 1- Copy from the backup file you made but you have to stop replication right after you started it (and the problematic trasnaction is done)
[19:34:58] <Amir1>	 or
[19:35:40] <Amir1>	 you have to wait days for the replication to fully catch up (I assume there will be more changes after that transaction) then stop replication and then copy the file
[19:35:58] <dhinus>	 yes I hope it won't be days but it will def be hours
[19:36:20] <Amir1>	 with the flow of writes I've seen, I think it'll take a couple of days :D
[19:36:32] <dhinus>	 yeah likely :)
[19:36:48] <Amir1>	 in wmf prod it's 1:3 or 1:5 if the section is quiet but in toolsdb it's 1:1.5 even
[19:36:50] <dhinus>	 so you would backup the ibd on the replica anyway
[19:37:46] <Amir1>	 oh definitely 
[19:38:15] <dhinus>	 my worry is that right now the replica file contains 4 million rows, and ideally I'd like in the end to have only 1 million (as in the primary)
[19:38:42] <dhinus>	 if I copy back the current replica file, won't I get back to 4 million rows in the replica?
[19:40:00] <Amir1>	 yup, you would, so the second way of doing it seems better
[19:40:21] <Amir1>	 I didn't know how massive the deletion was
[19:41:52] <dhinus>	 second way = wait for replication to fully catch up?
[19:43:09] <Amir1>	 yup
[20:58:05] <marostegui>	 I'm not really here but copying and moving ibd just like that won't work and will corrupt innodb tablespace entirely
[20:58:25] <marostegui>	 I can talk to you on Monday dhinus about how to skip that transaction 
[20:58:29] <marostegui>	 as I'm off tomorrow 
[20:58:42] <marostegui>	 if you want to backup the table do a mysqldump
[20:58:51] <marostegui>	 but don't copy the .ibd file
[20:59:57] <dhinus>	 thanks marostegui , we continued the conversation with Amir1 in private. we found another way that fixed the immediate problem...
[21:00:13] <dhinus>	 (we just added the table to the exclusion list Replicate_Wild_Ignore_Table)
[21:00:43] <dhinus>	 I'm writing a report in phab of what I did :)
[21:01:03] <dhinus>	 I'm also off tomorrow, but I will keep an eye on IRC
[21:03:40] <Amir1>	 thanks Manuel, sorry about that part!
[21:06:42] <dhinus>	 https://phabricator.wikimedia.org/T326261#8503035