[06:49:14] XioNoX: hope you don't mind but I did 2 grammar changes on last nights IR [06:49:51] RhinosF1: of course! don't hesitate [06:50:15] XioNoX: np, thought I'd tell you [06:50:27] I guess my brain is in picky mode today [12:13:59] does anyone else end up writing small adhoc scripts on various hosts? e.g. cumin hosts, deneb, mwmaint etc [12:14:27] i had a few scripts on mwmaint machines that i haven't used in a few months; went looking for them yesterday to discover that both machines have been reimaged in the meantime 😿 [12:14:59] o/ I do. I keep wondering if there is a better way. [12:21:58] btullis: exactly my question too :) [12:33:49] kormat: they should be in bacula but (not on wikimedia as I don't even have shell) I do all the time for like most of my projects [12:34:41] I'd imagine backing up somewhere yourself would be good [12:34:46] I have to go lesson though [12:52:39] kormat: I might try this editor extension for writing scripts locally and then syncing them to there they need to be run: https://github.com/thisboyiscrazy/vscode-rsync [12:53:08] mm, ack [12:53:47] i think i'm going to go write a cronjob on my laptop to run https://restic.net/ against a list of hostnames that have custom stuffs on them [12:54:17] the stuff i had on mwmaint* wasn't critical. if i lost what's on cumin1001, i'd be a _very_ sad panda [12:54:53] (mwmaint1002 was reimaged juuust under 3 months ago, but it looks like the backup from then has expired) [13:01:47] kormat: are these things likely to be useful to more-than-just-you? We could have a team "hacks and tools and shell scripts $DEITY should never look at" repo... [13:02:10] * Emperor had "misc" at their last place for such things [13:02:12] Emperor: things that reach the stage of being useful to more than just me do get migrated to such places :) [13:06:32] Restic looks nice. I'm trying to think about it in terms of version control for my scripts as well though. To cut down on just editing scripts on the server and running them. I used to use this tool in another life, but it might be a bit tricky here: https://packages.debian.org/bullseye/myrepos [13:19:26] oh, doh. it's been a couple of years since i used restic, so i forgot that it's model is not suitable for this. (it uses a push model; i need pull) [13:20:46] make yourself a gitlab repo, stash things there and clone/pull at will? [13:21:08] Emperor: that's a Lot of overhead for this use-case [13:21:46] the cycle goes from 'edit script; run script; repeat' to 'edit script; commit to git; push to repo; pull from repo on remote machine; run script; repeat' [13:22:56] (can't have credentials on prod machines to allow pushing to git from there) [13:23:30] Mmm, point [13:23:42] our "misc" repo is operations/software, altough there was no automated deployment for it [13:23:59] but I have it cloned through https on cumin, which is where I used to run most from [13:24:03] jynus: and would be worse again than the above cycle, as it now involves gerrit ;) [13:24:09] ha ha [13:24:09] jynus: yeah i do the same [13:24:32] rsnapshot, then. [13:24:35] still git (even gerrit!) > not backed up anywhere :-) [13:25:00] I think there is like hosts where there is a gurantee to migrate home [13:25:11] like people and others [13:25:24] but I guess mwmaint was not one- but weird that it is backed up [13:25:25] I was about to suggest rsnapshot too. You could use borg with sftp? https://borgbackup.readthedocs.io/en/stable/deployment/pull-backup.html [13:25:28] jynus: i'm rather surprised that wasn't done as part of the reimaging of mwmaint* tbh [13:25:56] so we have bacula for backups [13:26:17] no need for a new thing, but it is frowned upon to have things not puppetized, I think [13:26:57] jynus: bacula is great for prod stuff. it's not so great for this sort of thing, though [13:27:04] another options would be puppet/admin/home, and it is deployed everywhere [13:27:28] I think it will depend on the use case [13:27:34] jynus: that's taking all the git+gerrit issues, _plus_ making machine-specific stuff be on all machines [13:28:14] we could maybe hack that? [13:28:31] jynus: Agreed, I wasn't suggesting installing any new tool on the servers to support personal backups. Only new tools on workstations. [13:28:36] jynus: you can't hack around the huge overhead of git, even ignoring gerrit [13:29:19] so maybe you are thinking somehing in the "style" of NFS, a shared space? [13:29:44] jynus: i'm thinking of writing a shell script to run rsnapshot from my laptop against a list of hosts and directories [13:29:57] so that i have a local, sync'd copy of stuff [13:30:00] that would work, too [13:31:47] I think something that could be done better is clarify which hosts will not save home on reimage and which ones will [13:32:16] that would be useful, yeah. mwmaint hosts say in the motd that /home is backed up, which is nice but.. [13:57:50] alright, rsnapshot setup. just need to remember to add various hosts to its targets over time [14:04:31] I might be over cautious on this topic but IMHO anything that is not a one-shot emergency script should go through our standard pipeline and ideally reviewed. (and for the emergency too a 2nd pair of eyes might save us from making things worse fwiw) [14:05:10] we could and probably should simplify the "standard pipeline" for this use case though, on that I totally agree [14:05:56] volans|off: here's an example: https://phabricator.wikimedia.org/P17445 [14:06:43] <_joe_> kormat: are you telling me you run bash commands in production without a proper code review? That's unheard of! [14:08:06] <_joe_> so yeah I think that's overcautious, but also, we have a space in puppet (of which I'm not a fan) for saving personal scripts you want to persist [14:08:22] _joe_: already discussed above [14:08:38] <_joe_> and I wouldn't say that the fact the maint servers were reimaged and that old home files were in backups was properly communicated [14:08:46] <_joe_> *wasn't properly [14:09:16] <_joe_> maybe we need to stop expiring backups of home dirs for servers and retain them for say 1 year after the server is dismissed? [14:09:28] <_joe_> kormat: I'm not saying you shouldn't backup your stuff if you want [14:09:37] <_joe_> I'm trying to see how we could avoid you needing to [14:09:51] yeah, I was trying to see the saem [14:10:40] _joe_: trying to parse that line - are you saying you feel that nuking /home on mwmaint* was properly communicated? [14:10:59] <_joe_> kormat: yes, I was very aware of it at least [14:11:18] _joe_: can you point me to such a communication? [14:12:03] <_joe_> I remember reading it here actually [14:12:26] people sometimes can be on weekly-long vacations, or as I attended a deployer, only accessing those hosts very infrequently, I think it could be clearer [14:12:42] <_joe_> and yes [14:12:48] mentioning things on irc is very insufficient in terms of communication [14:12:49] motd or a proper pollicy [14:13:01] i do see a mail from mutante from _feburary_ saying the homedirs would be wiped [14:13:04] <_joe_> the real point here is that we have a short retainer on stuff that wasn't repurpose [14:13:10] <_joe_> yes, I just found it [14:13:17] (which happened in july for mwmaint1002, and oct for mwmaint2002) [14:14:32] <_joe_> so,I guess you're missing files that were on mwmaint1002, right? [14:15:26] _joe_: at this point it's moot. as it happens i had copied the above script to mwmaint2002 while codfw was primary. as that was reimaged much more recently, it's backups were still available, and jynus kindly recovered it for me [14:15:47] (which, frankly, is far too much energy expended giving the fairly simple nature of the script, but jynus was very kind <3) [14:15:54] basically my concern is (as data revovery people) is people should have clear expectations. e.g. if a host is backed up, it is also migrated (this is just an example, doesn't have to be exactly like that) [14:16:17] this is not the firt time confusion happened, I attended recently a similar case [14:16:36] as mentioned above, if my homedir on cumin1001 went away, i'd be very very sad [14:16:39] so I am worried about it [14:16:42] (but now i have backups, so i'm fine again) [14:19:55] <_joe_> so one reason why sometimes homedirs are not restored is that people don't clean up after themselves [14:20:08] <_joe_> so they litter homedirs with gigs of stuff they'll never touch again [14:20:20] <_joe_> I'm a first-class offender with that btw [14:20:38] <_joe_> so I get why every N years we'd like to start up clean [14:20:59] <_joe_> this is specifically bad on deploy* and mwmaint* hosts, btw [14:21:44] <_joe_> I remember once I found that the then-deployment server, tin, had something like 90GB of space used in its /home, and after asking people to save what they needed, we went down to less than 5 GB [14:22:03] <_joe_> but that took me a couple weeks, which isn't sustainable typically [14:22:46] that is certainly a reasonable thing to do, but as the person that gets the pings of deployers later with "my files disappeared", I think we could do better, be a bit flexible in some way or super clear [14:23:24] 90GB are nothing on backup storage- and yes, probably that is why it is backed up :-) [14:23:51] _joe_: the burden on people who do have stuff they want to keep is pretty steep, though. either they have to take backups (like i'm now doing), or basically file a ticket to get jynus to kindly restore the stuff after [14:23:56] <_joe_> jynus: but having to rsync it to another host, then copy it back, like we used to do, is time-consuming and shouldn't be needed in production. [14:24:03] because 1 notice in feburary for dataloss that happens in july is not cutting it [14:24:53] I am trying to find a compromise, if possible here, that would make most people happy :-) [14:25:53] I am more than happy to restore things- (although that has limited period of being effective- 2-3 months) [14:26:38] <_joe_> (btw, anyone with root is able to restore files) [14:27:00] _joe_: have you seen the procedure for doing it after a host has been reimaged? :P [14:27:15] I was saying it as that is not an issue for me, I am worried about the impact on users [14:27:28] <_joe_> kormat: I've done it multiple times to restore stuff, so unless it's changed radically, yes [14:27:35] we could increase backup retention for homes, we could have some periods of annoucements or something else [14:27:44] <_joe_> and I've done if for stuff of mine that was wiped in a reimage, too [14:27:59] _joe_: and do you not consider it to be burdensome? [14:28:11] it's great that the possibility is there, to be sure [14:28:49] <_joe_> kormat: sure it is [14:29:43] <_joe_> but wait, I assumed jynus wrote magic scripts that make it super easy [14:29:50] not yet :-) [14:29:53] I plan to [14:29:57] _joe_: https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key) [14:31:40] to be fair, it is better than it looks, I only had to do 4 commands, the rest are edge cases people have been "completing the docs" [14:32:18] <_joe_> jynus: probably the "happy path" should be more clearly outlined then :D [14:34:16] but again, that won't work for someone that realizes too late [14:34:51] and yes, puppetize and do personal backups, but we can aspire to be more friendly 0:-) [14:36:09] for example- let me know what you think- "if you send a notice of a reimage, let at least e.g. 30 days to pass before annoucement, if 60 pass without it doing it, repeat the announcement" [14:36:20] or "add it to motd" [14:37:07] (which wipes /home, it is implied) [14:37:50] and if it cannot wait, I have 1TB reserved for long-term archivals which can help [14:38:12] like we did with mailman [14:49:44] the "happy path" should be more clearly outlined -> Done [17:54:32] related to the mwmaint reimage problem is https://phabricator.wikimedia.org/T287303 [17:55:30] it's theoretically straightforward to puppetize your home dir in the admin module, but I too am guilty of not doing that [20:56:48] legoktm: its actually not so straightforward as folks don't like too much in those auto provisioned files (at least when I setup mine I was asked to trim them down a lot) and the files end up losing all exec bits. [21:16:24] * legoktm nods [21:16:29] hence "theoretically" :/ [21:18:03] Putting all adhoc scripts into ops/puppet is probably not going to work very well either. We don't have the human infrastructure for a change control board type process for all interactions with production data. [21:26:22] all I can say about this whole topic is .. for many years I have been copying the home dir data around during migrations and people said it's too much effort and people shouldn't expect those to be permanent or shells for adhoc work.. and the file sizes kept building up and nobody ever deleted stuff.. then one time kind of as a test I don't copy them (after confirming Bacula option exists) and the [21:26:28] result is requests for restore and this entire thread. so the test just showed to me to just go back to rsyncing files with a temp. puppet class (then people say it's overkill too but it's not really compared to restore and the discussion ) [21:52:24] the exec bit thing is a non-issue, you can add a line to .bashrc to fix that (https://github.com/wikimedia/puppet/blob/production/modules/admin/files/home/ori/.bash_profile#L3-L6) [21:58:32] ori: or the clone could be fixed to stop stripping exec bits [21:59:01] N people could hack around the bug or 1 person could fix it