[01:10:26] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 13.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[01:12:20] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104
[06:15:05] <marostegui>	 jynus: when is it normally a good day to do this? https://phabricator.wikimedia.org/T330861
[07:34:53] <jynus>	 when I pause backups, which I think soon will do
[07:35:26] <marostegui>	 cool
[07:38:30] <jynus>	 only 2821 files left to get up to date
[08:02:34] <jynus>	 marostegui: I am running an additional backup and it will be all yours when it finishes
[08:02:48] <marostegui>	 jynus: ETA more or less?
[08:02:58] <jynus>	 20-30 minutes
[08:03:04] <marostegui>	 great
[08:03:50] <jynus>	 meanwhile, I will start codfw backups
[08:03:55] <marostegui>	 ok
[08:04:05] <jynus>	 (I belive that is upgraded already)
[08:04:10] <marostegui>	 yes that's done
[08:26:56] <jynus>	 dump.backup1-eqiad.2023-03-24--08-01-56 completed
[08:28:36] <jynus>	 119 million succesful backed up files
[08:31:24] <Emperor>	 nice
[08:32:21] <jynus>	 ah yes, that should unblock you to start deleting files on eqiad :-D
[08:32:25] <jynus>	 too
[08:34:00] <jynus>	 the milestone may sound not that impresive, but think I am tracking the history of each blob as it moves and rehashing them with sha256, so I have a paralel metadata database
[08:36:11] <jynus>	 the 100K failed backups is almost static, meaning it most likely is an existing consistency issue and not a problem with my backup
[09:53:09] <marostegui>	 jynus: so I can proceed?
[09:53:16] <jynus>	 indeed
[09:53:59] <jynus>	 Emperor: can I ramp up codfw backup load?
[10:00:02] <marostegui>	 jynus: host back up
[10:01:00] <jynus>	 I wonder how to handle that when backups are fully autonomous? Shutting down the service? Primary switchover?
[10:01:40] <jynus>	 I will have a similar question for binlog backups & primary switchovers, and I may need some brainstorming there
[10:02:41] <jynus>	 probably the best thing would be to make the service detect the issue and start a wait for a retry
[10:05:08] <Emperor>	 jynus: ramp> sure; keep a gentle eye on the swift dashboard?
[10:05:34] <jynus>	 Emperor: that's for granted, always
[10:05:48] <Emperor>	 it has a couple of new panels now :)
[10:06:02] <jynus>	 if backups impact production, they are not good backups
[10:06:15] <jynus>	 but mentioning it in case there was some maintenance or something
[10:07:58] <marostegui>	 jynus: probably a switchover and treat them like any other service
[10:09:33] <jynus>	 yeah, the issue is that unlike most other services, it is not user-request based, it is continuous (think job queue rather than http requests). So like the job queue, it needs some extra thinking.
[10:10:50] <jynus>	 in the case of binlogs, the idea is to make sure we consume them allways from the primary server, but that could add overhead on switchover
[10:12:59] <Emperor>	 maintenance> looks like we've lost a couple of disks again :( but that shouldn't impact performance
[10:19:05] <jynus>	 Yeah, I saw that
[10:20:43] <Emperor>	 it's weird we keep losing 2 at once
[11:00:17] <jinxer-wm>	 (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost
[12:04:53] <Emperor>	 T332983 opened and marked as high priority
[12:04:54] <stashbot>	 T332983: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983