[01:08:25] PROBLEM - MariaDB sustained replica lag on m1 on db1217 is CRITICAL: 7.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:08:35] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 14.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:17] RECOVERY - MariaDB sustained replica lag on m1 on db1217 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1217&var-port=13321 [01:12:53] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [03:52:39] jynus: at 15utc the phabricator migration will be happening today. in case you want to run some last minute backups [05:34:51] https://grafana.wikimedia.org/d/000000377/host-overview?viewPanel=28&orgId=1&var-server=db2109&var-datasource=thanos&var-cluster=wmcs&from=1692758235082&to=1692768874368 [09:39:17] should I stop replication myself? [09:39:21] Re: phabricator [11:27:11] jynus: No, I will do it [11:27:14] But it is at 15:00 utc [11:40:22] no worries [11:47:36] what I mean is that I will stop it on the "production" host [11:47:52] the one we have for the quick failover [11:51:17] yes, leaving that to you. I meant I had planned for the backup already [11:51:36] ah cool [11:51:46] (and tested, but waiting a bit to run it) [11:53:49] There is one minor thing- becasue dbprov1004 hasn't been upgraded to 10.6 yet, preparation will have to happen on the host, cannot be done beforehand (so xtrabackaup /srv/sqldata --prepare before start on the host) [15:05:16] marostegui: I have a screen with "✔ root@cumin1001:~$ # Do not run unless emergency, will break data # transfer.py --type=decompress dbprov1004.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.m3.2023-08-23--13-34-58.tar.gz db1164.eqiad.wmnet:/srv/sqldata.s3" pending, in case someone else has to run it in an emergency [15:05:32] oh nice :) [15:05:52] hopefully if that moment arrives, I can simply promote db1119 to master [15:05:53] that way one has not to remember the options [15:06:02] yes ofc [15:14:52] I think I am going to leave replication stopped in the secondary DC [15:14:56] Until tomorrow [15:15:04] It wouldn't hurt [15:15:27] And on the "hot backup" host too [15:15:41] But I will restart it on the backup source (db1217) once I get the ok from brennen [15:16:10] I will remove my screen, though, to prevent accidents