[01:08:34] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 13.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:08:56] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 11.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:36] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:54] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [08:46:19] PROBLEM - MariaDB sustained replica lag on s5 on db1130 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104 [08:51:09] RECOVERY - MariaDB sustained replica lag on s5 on db1130 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1130&var-port=9104 [09:09:05] o/ today at 17:00 UTC we are finally migrating ToolsDB to a new host. I've added the expected procedure in T333471, if you have any comments/suggestions please let me now here or in the phab. [09:09:05] T333471: Move all tools from clouddb1001 to tools-db-1 - https://phabricator.wikimedia.org/T333471 [09:16:25] is anyone from the data persistence team around at the migration time? the migration will be done by myself and andrewbogott, but we could certainly do with an extra pair of eyes. :) [09:17:27] * Emperor is the wrong flavour of d-p person, but in any case will be gone by then today [10:30:50] dhinus: I think we are all out today and tomorrow (public holiday) [13:13:49] dhinus: I'm actually around but my low-level db knowledge is ... something to be desired [13:14:42] so can help (drop me a calendar invite) but can't promise much [13:15:39] Amir1: thanks, I'll share a calendar invite :) [13:33:22] Amir1: would you mind doing a quick review of the steps I listed in T333471 and let me know if anything stands out? [13:33:23] T333471: Move all tools from clouddb1001 to tools-db-1 - https://phabricator.wikimedia.org/T333471 [13:37:51] sure, give me a bit [13:55:54] dhinus: do you use gtid or pt-heartbeat? [13:56:01] gtid [13:56:15] that's the one I don't know much about :( [13:56:29] make sure you set it up correctly [13:56:31] haha and I know nothing about pt-heartbeat :) [13:56:58] but most importantly make sure the old master never becomes rw to avoid split brain [13:57:40] e.g. SET GLOBAL read_only = 0; this shouldn't replicate [13:58:30] I don't know if it gets written to binlog and replicate but better safe then sorry, do "set session sql_log_bin=0;" before doing that [13:58:54] beside that, I don't have anything major tbh [13:59:11] (our switchovers are mostly automated) [13:59:43] yeh read_only=0 should work, and that's also the default if the service is restarted [14:01:01] I'm pretty sure it wouldn't replicate but setting also sql_log_bin=0 before that is a good idea [14:01:11] read_only=0 means that the server IS writable [14:01:24] so the old master should need read_only=1 [14:02:25] yeah and oh btw, you must wait and make sure the replica has caught up before turning it into master, otherwise you end up with missing stuff [14:11:02] PROBLEM - MariaDB sustained replica lag on s1 on db1118 is CRITICAL: 6.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1118&var-port=9104 [14:13:16] PROBLEM - MariaDB sustained replica lag on s1 on db1154 is CRITICAL: 8.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [14:24:22] eqiad is not catching up [14:24:25] damn it [14:25:36] PROBLEM - MariaDB sustained replica lag on s1 on db1134 is CRITICAL: 8.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1134&var-port=9104 [14:26:53] RECOVERY - MariaDB sustained replica lag on s1 on db1134 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1134&var-port=9104 [14:30:06] urandom: I'd be late, dealing with an incident [14:32:30] fun times [14:40:35] PROBLEM - MariaDB sustained replica lag on s1 on db1119 is CRITICAL: 40.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1119&var-port=9104 [14:41:37] PROBLEM - MariaDB sustained replica lag on s1 on db1128 is CRITICAL: 20.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1128&var-port=9104 [14:42:43] RECOVERY - MariaDB sustained replica lag on s1 on db1128 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1128&var-port=9104 [14:43:55] RECOVERY - MariaDB sustained replica lag on s1 on db1119 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1119&var-port=9104 [14:45:48] zabe: I killed your s1 job [14:46:43] PROBLEM - MariaDB sustained replica lag on s1 on db1107 is CRITICAL: 52.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1107&var-port=9104 [14:50:39] RECOVERY - MariaDB sustained replica lag on s1 on db1107 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1107&var-port=9104 [14:52:05] RECOVERY - MariaDB sustained replica lag on s1 on db1154 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1154&var-port=13311 [14:53:03] RECOVERY - MariaDB sustained replica lag on s1 on db1118 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1118&var-port=9104 [14:54:24] ack [16:33:04] dhinus: need a minute, will be there soon [16:35:05] no worries, we're in the google meet and will only start actually doing things in 25 mins