[01:10:26] PROBLEM - MariaDB sustained replica lag on m1 on db2132 is CRITICAL: 13.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [01:12:20] RECOVERY - MariaDB sustained replica lag on m1 on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [06:15:05] jynus: when is it normally a good day to do this? https://phabricator.wikimedia.org/T330861 [07:34:53] when I pause backups, which I think soon will do [07:35:26] cool [07:38:30] only 2821 files left to get up to date [08:02:34] marostegui: I am running an additional backup and it will be all yours when it finishes [08:02:48] jynus: ETA more or less? [08:02:58] 20-30 minutes [08:03:04] great [08:03:50] meanwhile, I will start codfw backups [08:03:55] ok [08:04:05] (I belive that is upgraded already) [08:04:10] yes that's done [08:26:56] dump.backup1-eqiad.2023-03-24--08-01-56 completed [08:28:36] 119 million succesful backed up files [08:31:24] nice [08:32:21] ah yes, that should unblock you to start deleting files on eqiad :-D [08:32:25] too [08:34:00] the milestone may sound not that impresive, but think I am tracking the history of each blob as it moves and rehashing them with sha256, so I have a paralel metadata database [08:36:11] the 100K failed backups is almost static, meaning it most likely is an existing consistency issue and not a problem with my backup [09:53:09] jynus: so I can proceed? [09:53:16] indeed [09:53:59] Emperor: can I ramp up codfw backup load? [10:00:02] jynus: host back up [10:01:00] I wonder how to handle that when backups are fully autonomous? Shutting down the service? Primary switchover? [10:01:40] I will have a similar question for binlog backups & primary switchovers, and I may need some brainstorming there [10:02:41] probably the best thing would be to make the service detect the issue and start a wait for a retry [10:05:08] jynus: ramp> sure; keep a gentle eye on the swift dashboard? [10:05:34] Emperor: that's for granted, always [10:05:48] it has a couple of new panels now :) [10:06:02] if backups impact production, they are not good backups [10:06:15] but mentioning it in case there was some maintenance or something [10:07:58] jynus: probably a switchover and treat them like any other service [10:09:33] yeah, the issue is that unlike most other services, it is not user-request based, it is continuous (think job queue rather than http requests). So like the job queue, it needs some extra thinking. [10:10:50] in the case of binlogs, the idea is to make sure we consume them allways from the primary server, but that could add overhead on switchover [10:12:59] maintenance> looks like we've lost a couple of disks again :( but that shouldn't impact performance [10:19:05] Yeah, I saw that [10:20:43] it's weird we keep losing 2 at once [11:00:17] (SessionStoreOnNonDedicatedHost) resolved: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:04:53] T332983 opened and marked as high priority [12:04:54] T332983: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983