[01:09:31] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 25.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:21] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [08:23:43] jynus: db2093 (db_inventory) was upgraded to 10.6 last week, did the backup run fine there? [08:24:07] it runs tonight, sorry [08:24:13] but I can do a test run now [08:24:24] no no [08:24:26] don't worry [08:24:27] no rush [08:50:43] https://phabricator.wikimedia.org/P43590 [08:54:14] marostegui: I am thinking of scheduling the es backup this week later, I wonder when it could be a good time? As you may need to repool es servers after the maintenance (?) [09:03:22] jynus: how long does it take? [09:03:26] cause you maybe can run it now? [09:03:32] And it might be finished before tomorrow's maintenance? [09:03:42] If not, anytime after the 7th maintenance could be good [09:03:43] Up to you :) [09:04:21] it takes 32h+ [09:05:20] the thing is, if the window gets long, you may need extra time for repools [09:05:53] so maybe something like 2 hours after the window may be safer [09:15:15] yeah [09:15:19] that sounds good [09:36:03] this is long due configurability that will make easier to reschedule backups: https://gerrit.wikimedia.org/r/c/operations/puppet/+/886833 [09:42:41] oh that is useful indeed [09:54:20] I wonder if we should test the 10.6 upgrade with the 2 backup replicas first, rather than the misc? [09:54:39] yeah, so db2093 is done [09:54:45] if the backup works fine, I will upgrade db1115 [09:54:48] that way I can put them down without affecting production [09:54:55] then we can check with a backup source if you want [09:54:56] yeah, it did, see paste I did [09:55:29] Oh sorry missed that [09:55:38] the idea is to do a full cycle (backup - deletion - recovery) on a non trivial host [09:55:40] Then I am going to upgrade db1115, which is orchestrator too :) [09:55:55] as db_inventory is very very small [09:56:00] yep [09:56:45] we could also do some backup sources, there are some that are redundant [10:09:50] this will delay es codfw backups 24 hours, just in case: https://gerrit.wikimedia.org/r/c/operations/puppet/+/886834 [10:36:47] Orchestrator is going to be unavailable for a bit [10:40:57] orchestratror is now back [12:32:05] marostegui: I'm sorry to trouble you, but I'm having trouble finding out why these systemd service files aren't present on db1108 after a reboot. [12:33:51] did anything change? [12:33:59] Lots about these instances is present, but I'm expecting to find `/etc/systemd/system/mariadb@analytics_meta.service` and it's just not there. https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/mariadb/misc/analytics/backup.yaml [12:34:21] I haven't changed anything, just rebooted. [12:34:51] yeah, I don't think we have touched that instance in years [12:34:59] At least /srv content for both of them are there [12:37:36] I saw nothing weird on the host- but puppet was failing earlier on some hosts- maybe try rerunning puppet and/or reinstalling the mariadb package to see if there was some glitch or something [12:37:45] I did a puppet run [12:37:47] Just in case [12:38:38] btullis: so when you did the stop mariadb@XX it was all fine? [12:38:41] so the units were present? [12:39:05] ii wmf-mariadb104 10.4.18-1 that's old [12:39:41] I'm afraid that all I did was the reboot-single cookbook. I didn't think to check whether the unit files were present. I knew I'd have to start the slave on each instance manually after reboot, but otherwise Icinga was green. [12:39:43] ah right this host is buster [12:40:01] btullis: but does that stop mariadb? [12:41:28] The matomo instance is present [12:41:57] Oh.. [12:42:01] I started it [12:42:34] root@db1108:/srv# systemctl start mariadb@analytics_meta [12:42:35] root@db1108:/srv# [12:42:46] root@db1108:/srv# mysql -S /run/mysqld/mysqld.analytics_meta.sock -e "start slave" [12:42:46] root@db1108:/srv# [12:43:29] It looks fine to me [12:43:30] root@db1108:/srv# journalctl -xe -u mariadb@analytics_meta | grep -i 3352 [12:43:30] Feb 06 12:42:31 db1108 mysqld[27192]: Version: '10.4.22-MariaDB-log' socket: '/run/mysqld/mysqld.analytics_meta.sock' port: 3352 MariaDB Server [12:43:46] Not sure what you were looking at, but the instances start ok [12:44:12] Oh, I'm so sorry for wasting your time. [12:44:30] No no! [12:44:32] Not at all :) [12:47:30] I think I forgot that the service itself doesn't start on boot. In my mind it was only the replication threads that don't start automatically. Then I couldn't see the systemd instantiated aliases, probably just fat fingers. [12:48:25] btullis: there is config to do so, but I think it is only like that on cloud, for on production data > availability [12:50:06] Gotcha. Many thanks both. I feel slightly foolish, but if that's the worst of it that's a great outcome :-) [12:50:43] not foolish at all, I belive services not starting by default is an unexpected result, but done on purpose [12:51:22] specially unexpected if you weren't part of the discussion of why that is like that by defult AND didn't conciously changed the confing yourself [12:53:52] btullis: yeah, we don't start mariadb on boot in purpose, it is better than way, so in case of crashes or unexpected reboots we don't get unoticed possibly corrupted data in production [12:55:05] +1 thanks. I've got a few mariadb related tasks to do in the next few months, so hopefully the muscle memory will get a bit stronger too. [12:55:12] I guess one thing that could be done is to modify the existing reboot or create a new one specific for mariadb hosts [12:55:23] *script [12:55:37] I belive there was some WIP script already [14:50:11] What's the proxy patch, https://gerrit.wikimedia.org/r/c/operations/puppet/+/886904 ? [14:50:20] jynus: this is it https://gerrit.wikimedia.org/r/c/operations/puppet/+/886904 [14:50:31] looking [14:51:12] nitpick: move*D* [14:54:38] One thing I am seeing is binlog_format | MIXED on the new primary, so you will have to switch it and close the file manually [14:54:59] (I know you know this, just making sure it was in your mind) [14:55:19] yeah, it is fine [14:57:24] checked ips, port, hosts ,etc [14:57:32] in fact I am going to change it now that it is just a replica [14:58:31] doing it directly is ok, just being me ultra pendantic on reviews :-D [14:58:42] hehe I know, that's good [16:31:23] ugh, I have forgotten all the SQL I once knew [16:31:40] I never knew any [16:33:02] ;p [16:37:49] WITH cd AS (SELECT * from codfw.object WHERE deleted == 1), ed AS (SELECT * from eqiad.object WHERE deleted ==1) SELECT cd.name FROM cd LEFT JOIN ed USING(name) WHERE ed.name IS NULL; [16:38:00] ^-- feels gross [16:38:19] [sqlite]