[00:04:22] PROBLEM - MariaDB sustained replica lag on es4 on es1022 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [00:05:22] RECOVERY - MariaDB sustained replica lag on es4 on es1022 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=es1022&var-port=9104 [07:21:23] poor db1234, its name was too perfect to type on a TKL [07:21:27] is it me or have we had a lot of sad database servers recently? [07:22:07] I can't say for "before me" but the rate has increased recently indeed! [08:35:44] PROBLEM - MariaDB sustained replica lag on s4 on db2206 is CRITICAL: 44 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2206&var-port=9104 [08:37:46] RECOVERY - MariaDB sustained replica lag on s4 on db2206 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2206&var-port=9104 [10:49:51] morning, everyone. back at the office. please let me know if there's something that needs following up [10:50:00] kwakuofori: welcome back :) [10:50:31] thanks, kormat [12:07:39] corruption on db1156 [12:08:38] probably just needs an index rebuild, it is not a mw host so not jumping in [12:11:56] I get to it [13:52:25] PROBLEM - MariaDB sustained replica lag on s2 on db1155 is CRITICAL: 6480 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13312 [13:55:16] ^ expected? [13:55:49] I guess it is a fallout of db1156 [13:58:06] don't know but it's reducing [13:58:58] and its back in sync [13:59:04] nice [13:59:13] that was the db1156 [13:59:19] I fixed it [13:59:27] great! [13:59:33] https://phabricator.wikimedia.org/T363161 [14:03:26] RECOVERY - MariaDB sustained replica lag on s2 on db1155 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1155&var-port=13312 [14:13:46] PROBLEM - MariaDB sustained replica lag on s7 on db2122 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [14:14:46] RECOVERY - MariaDB sustained replica lag on s7 on db2122 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2122&var-port=9104 [14:39:45] urandom: o/ https://gerrit.wikimedia.org/r/c/operations/puppet/+/1021915 when you have time :) [15:07:00] elukey: oh boy, it's happening! [15:14:17] urandom: yesss.. ok if we restart restbase-codfw now? [15:14:27] so it picks up the truststore [15:23:24] elukey: did you canary it somewhere? [15:23:43] restbase always makes me (extra) nervous [15:23:58] but in general, I'm ok to proceed, yes [15:24:03] elukey: ^^^ [15:25:23] urandom: I didn't, if you want I can disable puppet, run it on one node, restart cassandras and you can double check [15:25:26] then we can proceed [15:25:28] wdyt? [15:37:02] Last job for this section: dump.matomo.2024-04-23--04-13-04 failed! CC btullis I will check now why it failed, but FYI, in case there is known maintenance or something [15:37:59] Error connecting to database: Access denied for user 'dump'@'10.64.16.31' [15:39:06] ^ Amir1 almost sure you haven't touched the dump grants for db1208.eqiad.wmnet, but asking you to discard that first as it will be easier [15:39:29] most likely it will be something ip-related or maintenance or something [15:39:37] I haven't touched dump grants [15:40:00] at least as much as I remember [15:40:07] thank you, as I expected [15:40:30] then debugging the hard way :-D [15:41:14] yeah, the user is gone. Probaly it was cloned from production [15:41:22] (the data) [15:41:35] will ask DE team [15:43:03] probably T349397 [15:43:04] T349397: Migrate the matomo host to bookworm - https://phabricator.wikimedia.org/T349397 [15:43:07] will comment there [15:46:11] I reopened https://phabricator.wikimedia.org/T349397#9736182 in case Ben reads this later, as I am 99% it is that issue, and should be easy to correct. [15:46:20] *sure [15:50:37] elukey: up to you; starting w/ codfw is a canary too in a way [15:56:16] urandom: will do it tomorrow :) [15:56:56] :) [15:58:14] sorry for being slow to respond, my broadband has been a bit flaky, and it seems like a few notifications have been delayed (or dropped) [15:59:45] I just got a text apology from Google, notice of a (ridiculously small) credit, and a promise that the disruptions are over 🙂 [16:10:42] jynus: Oh, sorry. I'm totally sure that's me. I migrated matomo from matomo1002 to matomo1003 last week and I probably just dropped the grant by mistake. [16:12:48] there was no mistake, just checking why [16:13:05] will add the account, retry backups and update the ticket [16:20:46] urandom: np! I'll wait for you tomorrow before proceeding so we can check together the canary, restbases makes me nervous as well :D [16:20:59] but the worst should be session store probably :D [16:21:53] nothing restbase ever works as it ought to [16:22:31] sessionstore I expect to be easy and carefree, but obviously there is more at stake :)