[01:03:28] PROBLEM - MariaDB sustained replica lag on s7 on db1170 is CRITICAL: 66.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=13317 [05:21:54] RECOVERY - MariaDB sustained replica lag on s7 on db1170 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1170&var-port=13317 [09:02:47] that wasn't backups, backups happen on db1171 [09:06:40] that's vslow/dumps [09:07:24] although this is not normal on the master https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104&refresh=1m&from=now-12h&to=now&var-job=All&viewPanel=2 [09:07:45] ha 00:42 zabe: running 'zabe@mwmaint2002:~$ mwscript namespaceDupes.php --wiki=viwiki --fix' in screen [09:08:01] viwiki lives in s7, and matches the massive increase in writes [09:08:17] zabe: any ETA on when that script will be finished? [09:08:17] should I kill that? [09:09:06] zabe: It is fine at the moment, but is it throttled? [09:09:06] not really, but I don't think it should take longer than a day [09:10:19] it has a batch size of 500 and then runs this->waitForReplication [09:10:38] sadly that batchsize is hardcoded and I can't easily reduce it: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/maintenance/namespaceDupes.php#369 [09:10:40] I wonder why it didn't get that db1170 was lagging [09:10:50] Maybe it only checks hosts in the same dc as the master? [09:12:00] zabe: To be honest, the increase is quite massive, it has basically 3x the UPDATEs on the master [09:12:48] sure, I can kill it if you prefer and write something to make it less heavy [09:12:56] zabe: yeah, let's play it safe [09:12:58] Sorry! [09:15:05] ok, done:) [09:15:39] thank you! [09:16:59] Back to normal values https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1181&var-port=9104&refresh=1m&from=now-3h&to=now&var-job=All&viewPanel=2 [09:18:33] so the script had like ~500 updates per second. what would be a fair thing? would 100 per second be ok? [09:23:11] It is hard to tell, but we can start with that yeah [09:23:31] ok [09:37:03] PROBLEM - MariaDB sustained replica lag on s4 on db2137 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13314 [09:38:05] RECOVERY - MariaDB sustained replica lag on s4 on db2137 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2137&var-port=13314 [11:05:41] Ah, upload failure tickets, how I love thee, let me count the ways... [11:05:44] ...that didn't take long [11:13:11] I am about to switch s6 eqiad master [11:39:03] btullis arnaudb can you guys check this? https://phabricator.wikimedia.org/T355660#9480535 [11:39:12] If you want me to create a new task I can do so [11:39:54] I am a bit worried that we have all the scripts scattered between cumin1001, cumin1002, etc [11:40:05] So let's try to give that some priority to get all fixed [11:40:14] Yep, I'll check it out. [11:40:39] btullis: It looks like it fails to connect to dbstore1009:3316 [11:41:07] This is what should work: db-move-replica dbstore1009:3316 db1231 [11:41:24] root@cumin1001:/home/marostegui# db-move-replica dbstore1009:3316 db1231 [11:41:24] Are you sure you want to move instance dbstore1009:3316 to replicate directly from db1231 [yes/no]? yes [11:41:24] [ERROR]: The move operation failed: The host is not configured as a replica [11:41:48] OK, I've never had to run that script before. I will check with arnaudb as well. [11:42:29] Yeah, that's what db-switchover runs under it, and db-switchover failed too, that's why I tested it on its own [11:44:54] btullis: It is probably safer to test with db-compare (which also fails): db-compare frwiki user user_id dbstore1009.eqiad.wmnet:3316 db1173.eqiad.wmnet [11:45:15] That doesn't work either, and it should (like it works with: db-compare frwiki user user_id db1231.eqiad.wmnet db1173.eqiad.wmnet) [11:45:41] Let's follow up here: https://phabricator.wikimedia.org/T355531 [11:45:53] Ack, thanks. [12:24:04] ack, will do :) thanks [12:38:33] inode percentage went from 0.0219% to 0.0211% in s3 xD [13:58:25] Amir1: you can deploy your schema change on the old s6 master, db1173 [13:58:29] let me know when done so I can reimage [13:58:37] awesome [13:58:38] thanks [13:58:40] keep in mind it has one slave [13:58:45] so do it without replication [13:59:05] noted, thanks [14:08:26] marostegui: schema change over, depooled [14:08:34] great thanks! [14:25:28] https://www.irccloud.com/pastebin/qGTWCdDN/ [14:25:39] why pagelinks is so damn large everywhere :(( [14:25:46] thankfully it'll be better soon [15:13:15] hey folks sorry to bug you [15:13:35] there are 10 db hosts in codfw rack b5 which we hope to move network link on Thursday for [15:13:50] See T355549 and below google sheet [15:13:51] T355549: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 [15:13:51] https://docs.google.com/spreadsheets/d/1PlGGLclKFYR9XaqjOLibhiwwny0fOD8gLMwsNhIzGRo/edit#gid=2011030091 [15:14:09] also pc2012 in that rack [15:14:34] the work is just to move each servers network link from old switch to new switch as part of hardware upgrade cycel [15:14:56] less than 60 seconds outage per host, with only 1 being down at any given time [15:15:28] from the db perspective is there a way we can proceed here? can we depool the hosts in question etc. before we commence the work? [15:15:45] topranks: We are lucky cause we only have two masters (pc2012 and db2107) so we can probably switch those before Thursday, but next time we definitely need more heads up than just 2 days [15:17:35] marostegui: I'm really sorry about that, its my fault we had to bring it forward due to extended leave of some SREs that I only became aware of late last week [15:17:48] normally we would always try to give folks a longer heads up [15:17:59] Yeah, no worries. I will get that ready for you [15:18:01] if there is any concern or pressure here we can also postpone [15:18:20] we are going to do it Thursday if that is agreeable, understand the timeframe isn't ideal [15:18:21] it is okay, I will take care of all that [15:18:24] thanks [15:20:23] marostegui: I will check in with you Thursday before we start to make sure we are ok to proceed if that's ok? [15:20:33] sounds good [15:21:06] ok, and hopefully I can get you a beer in Warsaw to say thanks :) [15:21:30] haha