[06:34:17] [07:34:03] <+icinga-wm> PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: database-backups-snapshots.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:24] Looks like I am going to spend another day with this! [06:34:56] Looks related to x1 [06:37:28] So the error log says error, but the backups table says it finished successfully [06:53:36] So, one log says it finished correctly, another one says it failed, and the DB says it finished correctly: https://phabricator.wikimedia.org/P17680 [06:54:01] The backup looks fine too on dbprov1002's path, so I am going to reset the failed unit as I believe the snapshot was actually successful [08:38:30] I guess that's better than the other way round... [08:49:24] marostegui: I think that s3 failed the trasnfer [08:49:56] [05:22:00]: INFO - Executing commands [cumin.transports.Command('/bin/mkdir /srv/backups/snapsh [08:50:01] ots/ongoing/snapshot.s3.2021-11-04--05-22-00')] on '1' hosts: dbprov1002.eqiad.wmnet [08:50:04] Nov 4 05:42:40 cumin1001 remote-backup-mariadb[24430]: [05:42:40]: ERROR - Transfer failed! [08:50:54] the ERROR log could state what failed though... [08:51:36] and maybe s6 too? [08:51:57] as far as I can tell [08:55:05] basically checking the ERROR lines above in the log and looking at the previous lines ("Running XtraBackup at" for the specific DB or "Completed command '/bin/mkdir" for the section name) [09:08:01] marostegui: I've sent https://gerrit.wikimedia.org/r/c/operations/software/wmfbackups/+/736652 [09:17:55] volans but that's a different error than the one I saw on dbprov1002 and cumin1001 [09:18:10] I will check again, but at the time only x1 failed [09:19:56] volans: I'm fine with that patch (thanks!!) but let's wait for Jaime before merging [09:23:10] sure, I was not planning on doing that ;) [09:23:25] marostegui: the error at the end of the logs is [09:23:26] ERROR - Backup process completed, but some backups finished with error codes [09:23:50] but if you search upwards for ERROR you see 2 "ERROR - Transfer failed!" lines [09:24:16] and from those is not totally obvious which section/host they are as you have to parse the previous INFO lines to extract those [09:24:38] with the patch I think it would safe and simpler to just grep for ERROR to get a full picture of what happened [09:24:59] but YMMV, so feel free to drop my patch if it doesn't fit for any reason [09:27:06] yes, but the db doesn't show any errors [09:27:44] ok then I have no idea, sorry :) [09:27:51] I just tackled it from the logs PoV [09:28:25] yes, I'm pretty lost too, as the log says one thing and the db another [09:28:33] and checking the snapshot itself it looks fine [09:28:46] also the unit also mentioned x1 [09:28:48] so no idea [09:29:12] quantistic backups [09:29:25] they change state based on when/how you look at them [09:29:28] :-P [09:30:05] at least I'm seeing 2 Ok and 1 fail, so the ok wins [09:30:15] but what matters are the on disk ones, and those seem fine [09:34:37] you could always delete some stuff and test the backups ;-) [09:41:36] volans: on which error log are you seeing s3? [09:42:05] cause at 05:22 I am seeing it worked fine [09:42:10] Nov 4 05:22:00 cumin1001 remote-backup-mariadb[24430]: [05:22:00]: INFO - 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. [09:43:35] grep remote-backup-mariadb /var/log/syslog | less [09:44:10] [05:22:00]: INFO - Running XtraBackup at db1102.eqiad.wmnet:3313 and sending it to dbprov1002.e [09:44:13] qiad.wmnet [09:44:15] [05:42:40]: ERROR - Transfer failed! [09:44:59] And that one isn't marked as failed on the DB [09:45:02] No idea what is going on [09:45:13] Let me try to re-run it [09:47:03] I have started s3's snapshot again [09:48:24] marostegui: I've assumed we do them sequentially [09:48:40] if they are in parallel the error message could refer to another backup started earlier [09:48:50] and in that case even more my patch should help :) [09:48:56] yeah, but why there's not an entry on the DB? no clue [09:49:05] there should be one either on-going or failed [09:49:08] (or finished) [10:02:26] Going to restart mysql on pc2014 [10:09:13] I wonder if all the transfer failures are the same as https://phabricator.wikimedia.org/T262388 [11:21:34] morning :D [13:41:23] volans: s3 backup finished fine [13:41:38] \o/ [14:06:50] Hi, I just noticed that our db1108 multi instance replica for backups we have character_set_{server,filesystem,} and collation_server set to binary, but on the master and failover replica we don't set them at all, which look like they default to latin1, etc. [14:07:03] should we set these to binary on the source instances too? [14:07:07] i'd assum they should be in sync. [14:08:15] yeah, if possible they should be exactly the same [14:08:26] ottomata: which master are you looking at? [14:08:34] e.g. an-coord1001 [14:08:57] ohh, sorry. didn't realise it was DEng stuff [14:09:16] i'm going to recreate the analytics_meta instance on db1108 for https://phabricator.wikimedia.org/T279440 [14:09:36] am reading docs here https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:10:29] i think i want to: make sure the mariadb confs are in sync. do a manual snapshot & recovery from either an-coord1001 or an-coord1002 (failover replica). [14:10:37] i can't use the existing logical backups because those will be out of sync [14:10:41] does that sound right? [14:11:01] yes, if you can stop one of the instances that are ok and copy it, that's the fastest way [14:11:23] make sure to stop slave first on them [14:11:26] yup i can do that. oh just manual datadir copy instead of using backup-mariadb script? [14:11:29] and note the replication position [14:11:49] yeah, if you can stop MySQL and do a scp, that's faster [14:12:01] ya we have a readonly failover replica [14:12:08] i can do it from that, right? [14:12:31] yeah, just issue a stop slave; show slave status\G and write that down [14:12:41] stop MySQL and then scp to db1108 [14:12:51] (make sure MySQL is stopped on db1108 too) [14:13:01] 'scp' is a lie. you probably want transfer.py? [14:13:14] :-) [14:13:42] yayayay [14:14:10] i'll write down what i'm going to do in phab ticket and verify with yall [14:14:20] sounds good [14:14:46] prev question: 'binary' is preferred? [14:14:51] or default values preferred? [14:15:29] we prefer binary in production MW but up to you :) [14:15:43] i have no idea, just want things to be consistent [14:16:04] whatever your masters are running I would say [14:16:26] okay, masters have it not set, so using defaults [14:16:35] oki [14:16:56] yeah that makes sense [14:18:18] although I'd say that latin1 is not a great default, but if whatever creates the table always specify the collation then doesn't matter in practice [14:19:02] yeah, but i guess i'm more concerned with consistency between replicas atm, probably not a time to change the master settings. [14:20:24] yes, that was my point [14:31:23] marostegui: it looks like our masters have log_basename=analytics-meta, you were saying this is bad practice right? [14:31:32] better not to set that, so the binlog files are named after the hostname? [14:32:26] ottomata: yeah, we have that unset in production [14:32:32] but it is really up to you [14:33:34] it seems somehow simpler to not rely on hostname, e.g. if we had binlog's named after hostnames, and we do a snapshot recreation to a new host, the binlog names will be incorrect? [14:34:13] marostegui: , asking because I was wondering what to do with the binlog files on an-coord1002 (our readonly replica) from which I'm going to make a snapshot copy over to db1108 [14:34:20] ottomata: we are in a meeting now [14:34:28] okay, answer when you have time, no hurry :) [14:35:06] on an-coord1002 the binlogs are named e.g. analytics-meta-bin.000761, but on db1108 right now they are named db1108-bin.001003, even though db1108 is a multi instance replica [14:35:41] analytics-meta seems more appropriate to me, and will resolve the 'what to do with the filename' question, so unless you have a reason for me not too i'll keep log_basename=analytics-meta [15:22:32] I'm starting to upgrade db2143 now [17:43:19] ottomata: sorry for the delayed review, pass done now. today got busier than expected. [17:50:35] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 102.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [17:59:29] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [19:13:41] thanks kormat, i realized I probably have to fix the backup replica first anyway. apprecitae it.