[01:06:41] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:06:47] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 4.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [01:07:57] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:08:03] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [06:32:53] heh, I am glad I am decomissioning db2091...while doing a transfer I just got memory errors [11:03:03] marostegui: three replicas in s7 have different index on revision table of frwiktionary only [11:03:12] why :(( [11:10:42] :( [11:10:50] if you create a task I can get it done next week [11:11:02] nah, I do it soon [11:11:15] probably on monday [11:11:27] remember to take cumin1001 reboot into account [11:11:36] yeah [11:11:47] I will usually wake up way later than that :P [11:15:02] or go to bed [11:17:40] haha [11:49:56] so monday is a US holiday, and Eric will be out [11:49:59] shall we move the meeting again? [12:39:47] Fine by me [12:40:29] jynus: I was thinking....given that deleting a file from a backup is a very rare operation, should we also email us (or someone) once a delete has been made? (not a dry-run) [12:40:33] Just for awareness [12:59:57] haven't thought too much about procedure [13:00:10] as that is a separate thing from automation [13:00:28] ideally we don't have to do any on that, T&S does [13:02:33] sure [13:06:11] right, but arguably that's also partially a security thing? [13:06:24] so then it makes sense to have a technical solution for that rather than only relying on procedure [13:12:56] sorry, I didn't fully get what you meant- please be gently to me, brain still not at 100% [13:16:10] i mean, having automatic emails being sent when someone deletes a file seems like a good thing for security reasons as well [13:16:33] i think your point may be, then T&S should request that feature, I guess, but I would recommend just adding it regardless :) [13:17:26] oh, no, what I meant is I acknoledge the need of a proper procedure, just that is a problem for future Jaime [13:18:00] past Jaime's work focused on the technical parts "make it easy" [13:18:07] alright [13:18:25] so what I mean is, while purly brainstorming [13:18:35] got it [13:18:50] Ideally we don't have to do anything, T&S has a script that they can run on their own [13:19:29] and you meant that there would be a case for us being like a security layer, which I would agree, right? [13:19:44] yes [13:19:44] e.g. to protext from T&S accidental deletion, right? [13:19:49] ok, then I got you [13:19:53] yeah [13:20:03] or even security breaches [13:20:09] yeah [13:21:04] so that would be Q1 minigoal, the non technical side of the solution [13:22:23] marostegui: the thing about emails is that I don't like them as part of automation [13:22:41] although a script could send one [13:23:47] jynus: yeah, I mean just as a way to be aware that something has been deleted [13:23:59] not sure, needs a think- security and easyness are here in fight [13:24:00] It is not like we are going to get 1 per day, which we'd end up ignoring it probably [13:24:17] my question is- is that not what you and I do already :-D [13:27:31] I am thinking like a step beyond- like T&S deletion script could write a log, which informs a pending operation; that way it is a one button action and cannot be forgotten, or something like that [13:29:10] I would recommend to keep it simple at first, plenty of time later to revise that should the need arise (or it may not) :) [13:30:19] yes, but there are some basic questions- who should receive alerting about this, ideally? (owning) Me? Data persistence? All SRES? [13:31:08] I would try with data-persistence and then we can tweak as needed probably [13:32:14] Adding an additional action to log erasures to a db would be quite easy: https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/eraseArchivedFile.php [13:33:40] and what measure we put in place to make sure a request is legitimate and not accidental? [13:33:58] e.g. is deleting 10 files weird? 1000? 1000000? [13:37:39] so to summarize, for now marostegui you are 100% right that we should send an email to requester and other SREs acking/confirming deletions [13:40:12] and on next request we should do -if we can- a session to validate script/training for other members of the team [15:10:46] (PrometheusMysqldExporterFailed) resolved: Prometheus-mysqld-exporter failed (an-coord1001:9104) - https://grafana.wikimedia.org/d/000000278/mysql-aggregated - https://alerts.wikimedia.org/?q=alertname%3DPrometheusMysqldExporterFailed