[01:09:49] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 19.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:09:55] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 17.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:39] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:43] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [09:46:25] jynus: could you review https://gerrit.wikimedia.org/r/c/operations/puppet/+/888359 for the upcoming m1 switch? [09:46:50] let me see [09:46:54] thanks [09:47:15] (I was preparing the downtime on icinga) [09:49:45] yeah no problem! [09:50:57] the comment here is confusing, after deploy, shouldn't we remove "future"? [09:51:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/888359/4/manifests/site.pp#721 [09:51:24] something like "m1 eqiad master" [09:51:35] mmm gerrit down? [09:51:50] works now [09:51:58] ah yes, fixing [09:52:56] sent new patch [09:55:00] one difference I can see between hosts is that db1176 contained the ops database/query killer [09:55:09] I think that is not needed for m1 [09:55:11] yeah, but that's a mistake [10:15:11] I'm stopping bacula now and disabling puppet there [10:15:14] it is idle [10:15:18] good [10:17:13] I've also downtimed the icinga checks [10:22:14] I just saw some differences in grants between servers [10:22:51] diff <(pt-show-grants h=db1176.eqiad.wmnet) <(pt-show-grants h=db1164.eqiad.wmnet) [10:23:50] my guess is the user for racktables has to be dropped and the dump users, if any left, have to be removed without session binlog enabled [10:24:06] Yeah, those two need to go away [10:25:06] not a blocker [11:17:33] only waiting for the prometheus job to confirm metrics are returning [11:23:39] metrics are back [11:33:56] jynus: is this on your radar? https://phabricator.wikimedia.org/T328408#8588962 [11:37:49] let's close that [11:37:59] cool! [11:38:48] backup1-* next? [11:38:50] jynus: which one do you want to go for next [11:38:51] right [11:39:05] db1205? [11:39:19] Let me create a task for it [11:39:47] we can do all, but let's start with codfw, as I may run some mediabackup on eqiad soon [11:39:53] excellent [11:40:15] or better [11:40:23] let's start with the replicas on both dcs [11:40:28] ok! [11:40:32] later the primaries [11:40:53] db2184 and db1205 [11:41:06] https://phabricator.wikimedia.org/T329499 [11:41:24] When can I do one of them? [11:41:30] any time [11:41:55] I think backups will run there from 0 to 9 hours tomorrow [11:41:59] Ah good, I am going to try to do both today and then leave that open until you confirm it is all fine [11:42:28] then I can delete the data and reload it from backup on a non-trivial case [11:50:47] going for lunch [13:28:37] so the plan at T329499 is to take the backups and do a test recovery to 10.6 to be 100% sure the workflow works and is well documented [13:28:37] T329499: Migrate backup1-* replicas to MariaDB 10.6 - https://phabricator.wikimedia.org/T329499 [13:29:21] and then we can do misc? [13:30:52] sounds good to me yeah [15:05:58] sobanski: let me update the invite calendar with that data [15:06:30] jynus: Andrea is on it already [15:06:34] ah, ok [15:07:19] let me also check your spreedsheet to see if there are updates [15:15:22] I added an entry based on the p*ging tool, but I have no context for it [15:22:43] I saw that, thanks. I’ll check in with the people you mentioned [15:28:23] please also feel free to remove it- I am adding sometimes stuff if I see gaps, to try to be helpful, but we can decide they are not really incidents worth documenting, etc.