[06:47:49] Amir1: I'm starting a schema change in s3 in eqiad [06:49:46] actually there's a warning on db1150 [06:50:41] running a backup it seems [08:00:48] FIRING: PuppetFailure: Puppet has failed on ms-fe1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:10:48] RESOLVED: PuppetFailure: Puppet has failed on ms-fe1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:11:54] Last dump for db_inventory at eqiad (db1215) taken on 2025-07-29 00:42:28 is 105 KiB, but the previous one was 112 KiB, a change of -6.9 % [10:41:59] that puppet failure was related to gitlab maintenance [10:42:20] (I was going to ignore it since it self-resolved, but then couldn't actually leave it alone) [10:50:29] last hours to ask my anything 🙌 [10:55:02] is there any progress on backups for gitlab, in case I'm asked about it while you're away? I think full deployment of its storage on apus is currently blocked on this question [10:55:37] the blocker was you being available so we coud discuss it [10:56:17] as we don't want to setup something that wouldn't work for other things outside of gitlab [10:57:10] tbf, it was also blocked on me splitting gitlab on its separate storage [10:57:29] but the backup sources got priority these weeks [10:59:55] if you want to send me questions to think about while you're away, fire away (though maybe by email); I hadn't realised you needed more input from me [11:00:30] yeah, basically, I will set it up, but there always has to be a compromise [11:00:54] gitlab maintainer setup their requirements [11:02:29] for gitlab, maintaining an offline copy and send it to bacula is enough, but that may not scale for other stores [11:03:20] At least initially it'll only be small use cases, 'cos the cluster is quite small :) [11:03:35] yeah, but that is why I wanted to talk to you [11:04:02] what could be future hypotehical needs for other users [11:04:08] and if we need to prepare for those [11:04:36] or if we should just do something easy for gitlab only, and change strategy later [11:07:10] in general: what does a backup of an object store looks like? [11:07:28] and it is ok to say: "don't know, ignore until we get there, just backup gitlab" [11:07:48] I mean in requirements, not in implementation [11:12:24] I think I'm tempted to say "let's not block solving the problem for gitlab on trying to solve the wider question" [11:12:50] I will interpret that in the worse possible way: "I will do whatever I think is ok" [11:13:47] I think if you do whatever is sensible to achieve gitlab backups, I'm content - I've previously said a couple of ways I'd think about doing so, and am happy to review any plan you come up with :) [11:14:40] I haven't seen those. Would love to see even hypotethical or theoretical solutions. Not that I don't have some, but I am open to further ideas. [11:14:56] as maybe I could be missing some [11:15:47] Note this will be the first time we backup an object storage (media backups mediawiki, not swift) [11:17:50] I've shared you a doc if you want to share those ideas (ok to copy and paste) [11:17:59] and I will have a look after I come back [11:18:22] (I will ofc end up adding more stuff [11:18:44] I am more like asking for things I haven't had into account like "read rate should be limited to 1000 reads/s" [11:18:55] that as the service owner you may want to impose [11:19:00] if that makes sense [11:19:27] but any input is welcome [11:19:48] or you may have expertise on cepth-related tooling I don't know [11:20:01] which are all reasons I value highly your opinion [11:20:08] *why [11:20:17] even if it is ultimatelly my thing to solve [11:25:32] I added a coment focusing on my main questions (not for you to solve, but so you can understand my biggest questions, in case you have a good answer) [11:27:30] having said that, gitlab is "easy" because almost anything will work [11:27:48] I just like asking questions even if we decide not to answer them yet [11:28:50] And then there are things that I may need an expert like you: "Can I ask for all objects changed since X timestamp efficently?" [11:29:16] if not, is there a way to do so by activating some log? [11:31:37] rclone has a command option for "objects more recent than [date]" [11:31:48] oh, I don't doubt it [11:31:55] the questions is, how fast it runs [11:32:07] I suspect it's a metadata query, so scales with container size [11:32:15] as if it has to read every single object to do so [11:32:29] it won't scale to the sizes of e.g. mediabackups [11:32:45] No, I think the bucket listing tells you what you need [11:33:03] those are the kind of things that I may have questions about, and while I don't expect you to have all answers [11:33:07] i.e. you list the bucket once (which will slow for a large bucket), and that contains the info you need [11:33:11] you sometimes may know better [11:33:33] yeah, that's the kind of bad scenario I feared [11:34:51] again, later that will have a practical answer "it takes X seconds to do it" and that will work or not [11:35:54] anyway, please dump there everything you can think of it [11:36:16] and I will take it into account and try to see what's the best way [11:37:07] Re: your current answer, I was talking about versioning on storage, not on source [11:37:52] on backup storage [11:38:44] because in the end, someone will want "give me the status of the storage at 11:38" [11:39:02] and depending how we store it that will be either impossible or possible but too slow [11:39:37] and that is ok if we decide "you will only be able to get the status of it every 1 hour, or every 1 day" [11:39:49] but it is still a decision [11:42:11] it is more important to think about the speed (and logic) of recovery than the speed of backup taking [11:42:14] I'm sorry, I don't understand what "status of the storage at 11:38" means [11:42:41] I am not that worried about knowing that something changed at 11:38 [11:43:00] that can be done even if slower or faster [11:43:36] but how can I recover the 11:38 state, where some files had not been created, others modified later [11:43:49] if I have a backup, e.g. every day [11:44:45] "A bug started at 11:38, corrupting files, please revert gitlab to that timespamp" [11:45:38] files uploaded after that time should not be available for security reasons [11:45:52] can you see that is not trivial? [11:46:37] I think you are now starting to understand my doubts :-D [11:46:56] based on your last doc comment [11:47:16] In an old-style backup setup (like I have at home), the answer would be "I can restore from the last backup I took before that time, which was last night" [11:47:26] yep [11:47:41] it is just that with an object storage things are more granular [11:48:39] plus also some more restrictions- usually we want to use deduplication as much as possible to save space [11:50:09] bacula handles incremetals for us in terms of files. I don't know of a tool doing the same automation for object storage [11:50:36] anyway, I don't want to steal a lot of your time, but wanted to know if you had thoughts [11:50:59] you will have time to add stuff while I am out, and I will see what's the best way after I come back [11:53:37] at least I made you see matrix :-D [11:54:20] and if I was able to demonstrate that backing up commons was theoretically impossible, but still we made it work gives us hope 0:-) [11:56:41] do you have a view on whether you'd like to backup to an object store or a filesystem? [11:57:06] view, as in preference? nope [11:57:06] [and do you know if bacula or minio claim to be able to usefully backup S3 buckets? ] [11:57:16] [yes, I meant preference] [11:57:25] I have a view that we shouldn't use minio [11:57:57] fair enough [11:58:14] and that filesystem is simple and reliable, but only if a bunch of objects are stored toghether (the bacula solution) [11:58:38] I mean, everything is a filesystem right? [11:59:15] so it will be like a circle of competing needs: reliability, simplicity, performance, etc. [11:59:17] No definitely not :) [11:59:22] ha ha ha [12:00:15] I have not looked at plugins [12:00:30] but I consider that implementation details, I am yet on "what's the architecture we need" [12:00:46] and once we fix requirements, we chose the best technology there is for it [12:01:01] it helps ofc knowing available architectures to look at tech [12:01:50] but it is important to note backups are a complete different set of needs than production serving (no need for high concurrency) [12:02:51] there was this other s3 open source solution that I have yet to look at [12:03:34] garagehq [12:04:28] good thing is gitlab is small enough to be able to experiment, unlike mediabackups [12:10:48] thank you, Emperor only this talk was already useful [12:11:12] and I hope at least this was informative to you to understand the questions I face [13:12:38] best wishes [14:30:15] o/ can I get a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1173974 ?