[01:08:16] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 11.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:09:18] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 11.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:10:14] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 15.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:54] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:28] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:11:50] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [06:13:02] I just switched over x2 codfw [07:33:28] PROBLEM - MariaDB sustained replica lag on s4 on db1121 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [07:34:36] RECOVERY - MariaDB sustained replica lag on s4 on db1121 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1121&var-port=9104 [08:33:56] Reminder that Thursday at 09:00 AM UTC is when DB maintenance finishes in preparation for the DC switchover [08:34:05] So no more maintenance after that [09:42:12] backup2002 must have had a connectivity issue, because both backups failed 2 minutes after start at 0h :-( [09:48:53] or maybe it is worse and there is an IO problem, checking [11:17:21] I am investigating why db1219 (new host) is catching up so slowly [11:25:35] jynus: db1225 isn't show in up in grafana? [11:25:43] or db1216 [11:25:49] it used to [11:26:00] Ah nevermind [11:26:07] I was looking at a timeline where it wasn't present [11:26:11] it is there if I look at now [11:26:39] ah, that caught me in the past, yeah [11:27:14] apparently the spare::system is no longer a thing even if it is on the template [11:28:06] assuming I have permissions, can you show me how to purge a host from orchestrator, so I don't have to ask you every time? [11:28:31] sure, you can just go to the UI [11:28:48] click on the host (each instance) [11:28:56] and hit "forget" [11:32:00] that is one per mysql instance? [11:35:00] yes [11:38:55] I think I've done it [11:39:47] yep [11:39:49] it is gone [11:40:15] in theory I updated zarcillo, too, but it may take some time to take effect [11:41:43] there are some hosts that may require update/review db1101 maybe? [11:42:21] (not a priority, can be done after the full batch is processed) [11:42:37] https://grafana.wikimedia.org/goto/kWkkX5E4k?orgId=1 [11:43:04] I deleted some of those already I think [11:43:24] then it may be the cron [11:43:47] db2144 is an interesting one, as that one was rebooted today I think [11:44:16] we will have a look other time with less things on the fly, that won't be an issue not even for metrics [11:45:13] db2144 just got cleaned [11:45:20] Which is expected [11:45:22] so all good [11:45:30] db1219 should also fix itself as I just rebooted it [11:45:46] yeah, there is delays on cron + prometheus query window [11:46:02] that's why it is better to research when less things are happening [14:35:44] o/ I would appreciate a quick review of T334947 :) [14:35:46] T334947: ToolsDB: discard obsolete GTID domains - https://phabricator.wikimedia.org/T334947 [14:42:58] 👀 [15:35:47] dhinus: my advice is don't go there if you don't really need to [15:37:04] marostegui: I don't really _need_, but GTIDs are becoming pretty long and it would be nice to clean them up. the docs make it look easy but maybe it isn't? :) [15:37:11] dhinus: https://jira.mariadb.org/browse/MDEV-30386 [15:37:23] I filed that a few weeks ago and still working with Mariadb on it [15:37:36] haha, thanks :) [15:37:48] I guess I will at least wait to see how that ticket evolves [15:38:00] it is blocking us in production [15:39:43] at the same time, is it worth trying if I'm lucky and I don't get that error? [15:42:45] my understanding is that you are not being told "don't do it", just that there is a lot of issues with it and you may only waste time in our experience. However, it is way more feasable on tooldb than on production, because you may be able to stop all db services at the time (just my personal opinion) [15:44:40] thanks both for your feedback and for the link to the jira issue! [15:44:46] I would also not touch gtid and give up on it- Amir1 can testify, as the other day (while Manuel was out) I gave him a retrospective :-D [15:46:04] the funny thing is, when MariaDB first announced it publicly and widely, I told Monty- that looks like a terrible idea, but he didn't want to listen [15:46:20] dhinus: if the only benefit you are after is just making it look nicer and there's not technical reason behind it (we do have one in production) I think you may be wasting your time or potentially getting yourself into dangerous situations [15:48:34] dhinus: and all this comes from an old bug we also filled and triggered all the domain_id deletion stuff: https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-12012 [15:48:57] with give up I don't mean literally remove it- more like "ignore it as much as possible" [15:51:07] yes it's only a cosmetic thing, I think I will postpone it for now, and link the two jira issues in my phab task [15:51:50] you can also make that task a subtask of this: [15:52:17] https://phabricator.wikimedia.org/T324965 [15:52:34] sounds good yeah! I didn't see that task before :) [15:53:08] and wow I thought my gtid was long, but that one is MUCH longer :D [15:54:02] and equally useless! [16:06:25] done, moved as a subtask of that one! thanks again for your help, always much appreciated :)