[07:11:33] inflatador: o/ sorry just seen the msg, I was already afk! Thanks btullis :) [08:05:05] I will restart db1117:m1 replication, as I saw no issue on librenms [08:08:50] jynus: +1 [08:12:40] thanks for the +1, XioNoX! [09:16:56] fyi: We are going to depool services in codfw soon for the upcoming pdu replacement later today [10:55:04] I'm getting a bunch of errors for puppet runs about confd, anyone changed anything there lately? (/me going to the git logs) [10:55:28] dcaro: o/ Probably me. [10:55:40] Sorry. [10:55:52] no problem at all, can I help in any way? [10:56:35] 'E: Unable to locate package confd' that's what I'm getting [10:56:47] Where are you seeing it? [10:56:59] clouddb1016.eqiad.wmnet (and several others) [10:57:08] https://alerts.wikimedia.org/?q=team%3Dwmcs&q=alertname%3Dpuppet%20last%20run [10:57:28] not only cloud stuff though (just unfilter the team) [10:58:04] maybe a reprepro issue? [10:58:22] either a puppet or a repo issue probably [10:58:43] let me scan puppet for recent confd related patches [10:59:19] OK, possibly not me. Sorry, I've been working more on etcd, bootstrapping a new cluster. [10:59:39] okok [11:00:53] apt policy shows no package available on clouddb1016, but on coludcephosd1025 it comes from http://apt.wikimedia.org/wikimedia buster-wikimedia/main amd64 Packages [11:01:41] is clouddb1016 buster or bullseye? [11:01:50] bullseye [11:01:56] (seems to be the main difference) [11:02:01] maybe a data point to pull from [11:02:23] pointint to something wrong on the repo, probably [11:03:56] yep, confd is not there https://apt-browser.toolforge.org/bullseye-wikimedia/main/ [11:04:33] it's a bit shorted than the buster too, so maybe others are missing [11:06:08] I wonder if it was ever there, or it just didn't get upgraded yet? [11:06:33] hmm, so that would mean that puppet did not want to install it before? [11:07:29] I see no references to confd + bullseye [11:07:57] so my guess is you are the lucky first maintainer to upgade a host to bullseye ? [11:08:12] with that role / package depenency [11:09:12] that part I cannot say [11:09:14] I think I remember someone mentioning that we don't yet have confd on bullseye, so if I wanted it I would have to package it. [11:09:14] there's many, there's 448 hosts failing to run puppet: https://alerts.wikimedia.org/?q=alertname%3Dpuppet%20last%20run [11:09:40] maybe it is a puppet regression then? [11:10:05] maybe, checked the logs for 'confd', the last diff with that word is from july 26th [11:10:28] backup1002 should absolutely not have a confd [11:11:15] so I am close to shutdown puppet master now [11:12:03] jbond is fixing the confd stuff I believe [11:12:13] might be yes [11:12:18] Amir1: seemed to be talking to him in -operations [11:12:35] hmpf, I think i got the date wrong from the log [11:12:38] ack, will go there [11:12:38] yeah John is on it [11:12:51] -operations gets far too noisy [11:12:59] I wonder if confd is on the task to fix that [11:13:31] sorry all this was a change that should have just made it to sretest1001 but a missing `$` ment it went out everywhere [11:13:53] what's the patch? [11:13:56] ack, np [11:14:01] let us know if we can help [11:14:11] I added https://phabricator.wikimedia.org/T314118#8142383 for future as a follow up [11:14:20] jynus: let me fix the issue first [11:14:29] dcaro: i think im fine but thanks [11:15:21] dcaro: sorry, I assumed you *needed* confd for some reason there [11:16:03] I didn't realize you were confused why it was needed, not why it didn't install, which is what I was trying to debug [11:17:56] I wasn't sure if it was needed either, but I did not notice a clear addition of the package in the puppet logs (was looking for confd as package, not confd::file def) [11:18:51] so I guessed it was there already (actually saw confd running on some other hosts right before the alerts triggered, and was wondering why it was needed xd) [11:19:45] that reminds me, I should clean those up once the fix is in :) [11:20:55] I am going to open an incident doc [11:21:01] I think this is serious enough to require it [11:21:22] I take IC [11:22:12] 👍 [11:24:33] oh, and I sholud change the way the commits are displayed on my cli, as the date I got was the author date not commit date, that threw me off a bit too [11:25:03] dcaro: please add that to doc, that will be useful [11:25:09] okok [11:25:16] can you share the lin [11:25:19] *link? [11:25:33] (when you have it xd) [11:26:34] see ping [11:26:44] 👍 [14:41:14] kart_, urbanecm: andrewbogo.tt has dropped the old labweb hosts from dsh so the warnings about them in scap seen this morning should have gone. [14:41:24] amazing, thanks! [14:42:25] np! [14:51:00] RhinosF1: cool! [16:37:22] dhinus: likely related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/820248 [16:37:26] I'm running the decom for labweb1001 (with help from andrewbogott). The cookbook is showing a seemingly unrelated diff trying to remove gerrit2001 [16:37:57] Maybe mutante is running the decom script at the same time? [16:42:56] the cookbook is trying to update dns.git I think, and generated a commit with message "fnegri@cumin1001: labweb1001.wikimedia.org decommissioned, removing all IPs except the asset tag one" [16:43:58] but that commit seems to include a bunch of lines removed that are related to gerrit2001 [16:44:27] dhinus: that above patch (which is merged) means that I'm confident that it's not destructive for you to proceed. I'm definitely puzzled by how the cookbook is acting though. volans would be interested if he were here :) [16:45:04] dhinus: does the diff /also/ include the actual host you're decom'ing? [16:45:09] it does [16:45:30] ok, I think you should copy/paste all that into a bug and then 'go' [16:45:56] let's see if my screen-inside-tmux window lets me copy it :D [16:45:59] I predict that somewhere in california a terminal is waiting for daniel to type 'go' on the gerrit decom :) [16:50:45] jbond: I tried to run the decom cookbook but there were unexpected DNS changes [16:51:04] dhinus: ACK, same conflict for me the other way around [16:51:07] made me cancel [16:51:15] was trying to do a live demo of decom [16:51:47] I am unsure now if I should repeat my cookbook run.. but trying it [16:51:56] haha, what's the best way to untangle this? [16:52:38] I don't really know it either. so did you continue your run? [16:52:41] at the DNS step? [16:52:42] not yet [16:52:53] mutante: I suspect that if dhinus hits 'go' that it'll just clean up both? Unless that breaks the next run for you... [16:53:00] ok, so ... I _do_ want to remove gerrit2001 [16:53:41] andrewbogott: I already canceled mine. I guess it depends if there are other steps after the DNS change [16:54:02] I think dhinus shoudl continue, then mutante should re-run [16:54:04] spicerack.remote.RemoteError: No hosts provided [16:54:08] During handling of the above exception, another exception occurred: [16:54:13] wmflib.interactive.AbortError: Confirmation manually aborted [16:54:19] andrewbogott: agreed [16:54:28] I'll hit "go" then [16:54:33] and then mutante's run may or may not work, depending on how pedantic the cookbook is :( [16:54:36] let's see what happens [16:54:37] dhinus: if the diff shows only labweb1001 and gerrit2001.. then yes, do it [16:54:42] yes, confirmed [16:55:51] my cookbook is proceeding with the following steps [16:56:00] where does that dns.git repo live by the way? [16:56:18] ha, ERROR: some step failed, check the task updates. [16:56:24] https://gerrit.wikimedia.org/r/admin/repos/operations/dns [16:56:26] dang :( [16:56:28] dns is operations/dns [16:57:04] dhinus: which step failed? [16:57:13] I'm trying to find it... [16:57:52] I don't see any obvious error lines, but it's quite a long output, I'm going through it [16:58:07] I [16:58:12] I'd expect it to be at the end [16:58:16] Unable to connect to the host, wipe of swraid, partition-table and filesystem signatures will not be performed: Cumin execution failed (exit_code=2) [16:58:25] See https://phabricator.wikimedia.org/T313861#8143291 [16:58:41] the netbox-generated parts live in https://phabricator.wikimedia.org/source/netbox-exported-dns/browse/master/ [16:58:42] right, just found the same comment [16:59:10] mutante: still needed? [16:59:19] jbond: yes, I think so [16:59:20] that failure sounds like it happened because the labweb* hosts have been with puppet disabled for too long [16:59:21] I think you can just leave a note for DC-Ops to wipe the disk when they do on-site steps [16:59:36] dhinus: that's weird but looks like victory to me. [16:59:42] thanks taavi, there's the commit with the "wrong" diff https://phabricator.wikimedia.org/rONED15cd621f2e045203888b4a0d93a7824a270ddf6f [16:59:46] but I'll start puppet on the other host so you don't hit this again [17:00:19] mutante: your turn :) [17:01:07] RhinosF1: I'd be hesitant to make such recommendations [17:01:29] Amir1: I'm pretty sure I remember it being done very recently [17:01:54] andrewbogott: trying to rerun cookbook, ACK! [17:02:41] RhinosF1: each case is different and there is the point of dcops being understaffed and having too many things to do while core SREs can take care of it [17:04:24] ==> ATTENTION: the query does not match any host in PuppetDB or failed [17:04:37] proceeds anyways [17:05:06] Found physical host [17:05:17] Host not found on Icinga, unable to downtime it// [17:06:09] mutante: if thehost has had puppet disabled for more then 2 weeks it wont be in puppetdb or icinaga [17:06:13] it's asking me for mgmt pass now.. [17:06:17] powered off [17:07:01] Host gerrit2001.wikimedia.org already missing on Debmonitor [17:07:01] rephrase: if puppet hasn't ran for more then two weeks it wont be in puppetdb/icinga [17:07:02] Removed from DebMonitor [17:07:25] it was removed from puppet in the previous run that I aborted because of the DNS conflict [17:07:32] Removed from Puppet master and PuppetDB [17:07:33] Sleeping for 3 minutes to get netbox caches in sync [17:08:48] ahh well yes tat will do it as well :) [17:09:43] Generating the DNS records from Netbox data. It will take a couple of minutes. [17:10:14] Should the cookbook have a lock so 2 people don't conflict [17:10:19] yes [17:10:34] and it should ignore the file site.pp when it checks for the hostname in repos [17:10:44] it should alert on all others, but not that one [17:10:54] Amir1: now that that server is powered down and purged from e.g. puppetdb, do you know how we can wipe the drives? Is there a mgmt> way to do it? [17:11:01] since you need to have it until after the cookbook run [17:11:13] andrewbogott: the decom cookbook wiped the drives.. in my case [17:11:38] mutante: right, but it didn't for labweb1001, hence my question. [17:11:46] It did everything else but left a warning about failing that part [17:11:55] the cookbook should take care of it [17:11:55] re locking, i know its on vola.ns list not sure if there is a task [17:12:11] I see, Andrew. gotcha [17:12:22] Amir1: it failed for andrewbogott and dhinus [17:12:29] mutante: can you raise a task for the site.pp bit [17:12:40] RhinosF1: I know but we are running it again [17:12:44] Amir1: but it didn't, which is why RhinosF1 was suggesting that we ask dc-ops to do it, which is where you entered the conversation [17:12:49] jbond: ok [17:12:58] thx [17:12:59] so.. cookbook run finished as failed [17:13:04] I don't think it's idempotent, once it's finished it will say the server doesn't exist on a second run [17:13:07] DNS was ok: END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:13:12] END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts gerrit2001.wikimedia.org [17:13:13] andrewbogott: yu can check what the cookbook dose, it likley uses ipmi or redfish [17:13:43] Amir1: mutante is running theirs again for another host [17:13:54] oh I confused these two [17:13:55] jbond: ok -- I was assuming that Amir1 had done it by hand since that's what he was advocating :) [17:14:12] okay, let's try seeing what the cookbook does [17:15:06] https://github.com/wikimedia/operations-cookbooks/blob/82367ef847bee8ab67c988229865926c994e0f2d/cookbooks/sre/hosts/decommission.py#L311? [17:15:41] yes, /sbin/wipefs --all --force [17:15:43] that looks like it [17:16:03] I wonder if we can power it up and then simply run this [17:16:18] won't powering it up cause it to reappear in a bunch of places? netbox, icinga, ? [17:16:22] I think the cookbook overwrites the boot sector [17:16:29] to prevent that [17:17:42] if it's powered off, it means it passed that point and wiped it off, right? [17:17:48] 'Wipe bootloaders to prevent it from booting again' [17:17:51] if that step happened [17:17:57] https://github.com/wikimedia/operations-cookbooks/blob/82367ef847bee8ab67c988229865926c994e0f2d/cookbooks/sre/hosts/decommission.py#L330 [17:18:14] mutante: im a bit confused on the issue with gerrite2001. from https://phabricator.wikimedia.org/T243027#8143239 it looks like the deomc worked fine, which explains why the next attempt failed [17:19:16] jbond: I assume running it twice means this is normal behaviour [17:19:22] first time it does some things [17:19:28] then we get to the DNS part..it asks me to check [17:19:30] I say NO [17:19:34] jbond: it got aborted at the remove from dns stage because dhinus and mutante stepped on each other's toes [17:19:42] (because of the conflict with another user running it at the same time) [17:19:51] I run it again.. it does some remaining steps [17:20:02] but only after other stuff "failed" because it was already done [17:20:37] I still think it was right to abort though because that is the point of checking the DNS diff [17:22:01] mutante: i think you where right to abort as well [17:22:23] and thanks for the explanation [17:22:25] maybe we could've aborted _both_ runs, but I'm not sure that would have made things easier? [17:22:33] If chaos ensues any time two people decom at the same time we need a lockfile [17:22:50] i think it would be nice if https://phabricator.wikimedia.org/T243027#8143239 somehow showed you aborted and added the steps that where skipped [17:22:55] jbond: so.. the host is gone from DNS, I can't ssh to it anymore.. I am removing it from site.pp now.. in netbox it's in status "decommissioning" .and mgmt still exists [17:22:59] does sound right? [17:23:24] mutante: yes that all sounds right to me [17:23:35] ok, then I will consider it resolved now. thank you [17:23:46] also looking at the output from both tasks i think everything ended up getting completed withthe two runs compined [17:23:54] sgtm, no problem [17:24:35] andrewbogott: dhinus: yea, so from my side it's over. how about labweb1001? similar to that above ^? [17:25:01] I think the script worked fine for labweb1001 except for wiping the disk [17:25:23] I mean, /not/ wiping the disk [17:25:32] ACK, I think this means we have to ask dcops to wipe it [17:26:07] you could try booting it via mgmt but if we can see in logs it wiped the boot sector ..there is little point I suppose [17:26:22] can I proceed with labweb1002 in the meantime? [17:26:42] yes [17:26:46] I am not going to run another cookbook. go ahead [17:42:52] END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts labweb1002.wikimedia.org [17:43:34] cool:) [17:49:43] andrewbogott: I'm going to merge the cleanup patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/817386/1 [17:50:02] ok! [17:56:13] done. can you take care of reassigning to dc-ops and mentioning they will need to wipe the disk? [17:56:40] yep! [18:00:21] thanks, and thanks everybody for the help :) [20:46:11] Hello team, do you know if there's a way to run puppet in an instance and have it 'reset' to it's original state? For ex. If folder permissions were changed I'd like puppet to revert everything as described in the manifest. [20:46:28] '# run-puppet-agent' does not seem to implement that behavior... [20:51:22] run-puppet-agent ought to do that, but the logic around what's actually enforced by puppet, and what isn't, can be... surprising :) [20:51:30] can you point me at what directory you're looking at? [20:52:31] rzl: Yes, it's the '/var/lib/rancid/' directory in the 'netmon1003' instance. [20:52:52] looking -- where's it set up in the puppet repo? [20:54:20] It's in 'modules/rancid/manifests/init.pp'. Thanks in advance. :D [20:56:51] this is one of those things where there's a trick to it, but I forget what it is exactly -- one sec while I try to find it again [20:58:10] (often the problem is that you didn't say "recurse => true" when you meant to, but that doesn't look like what you want here) [21:01:17] hm, I see that you have managehome => true on the user and then you also have the directory as a separate file resource, I wonder if that's getting tangled [21:02:49] I see files within /var/lib/rancid are defined in puppet but not the /var/lib/rancid itself? [21:03:02] /var/lib/rancid/core is [21:04:11] doh thanks you're right, I was seeing /var/log/rancid [21:04:28] okay then yes, mutante's much simpler explanation is correct, thanks :) [21:04:58] denisse|m: looks like you would have to add that directory.. but since it isn't now.. maybe the easiest fix is to compare netmon1002 and netmon2001 and manually fix it [21:05:28] it does look the same though.. [21:05:37] mutante: it's listed as the rancid user's homedir [21:05:47] So maybe that was auto creating and never managing after [21:06:23] found: "This parameter has no effect unless Puppet is also creating or removing the user in the resource at the same time. " [21:06:29] https://puppet.com/docs/puppet/7/types/user.html#user-attribute-managehome [21:06:35] hmmm [21:08:04] the way I read that is.. if the user already existed when somebody added the "managehome" line then it might not do anything to add it [21:08:38] and it only says it will "create" and "delete" but does not say if it will enforce any permissions [21:08:40] Another question team, I see the following alert:... (full message at https://libera.ems.host/_matrix/media/r0/download/libera.chat/881fa59e06d41462c932e04abdb8434ea9c45299) [21:09:00] yeah I think that's correct, I saw somewhere that it just means it runs useradd -m [21:09:25] so if you actually care about the directory permissions or other details, you'd want to manage it with a file resource [21:09:41] mutante: re /var/lib/rancid: That makes sense. So it doesn't change the permissions because they're not explicitly defined in the puppet manifest, right? [21:10:04] denisse|m: yea, because there is no file{} section for the dir itself, only for things inside it [21:10:48] Thanks for the help team! :D [21:10:53] re: keyholder.. "sudo keyholder status" on netmon1003 says "no idenities" [21:10:59] so something there failed [21:11:08] event though all you say looks right to me in that paste [21:12:44] denisse|m: where did you get the passphrase from? [21:13:29] From the pw repository: ruby pws.rb ed ~/Wikimedia/pw/network-monitoring-keys-passphrase [21:13:53] thanks, I was looking for rancid* or keyholder* in there [21:13:55] 'sudo keyholder status' shows it as active now. [21:13:56] let me try to arm it [21:14:08] wow, indeed [21:14:16] something loads and unloads it?! [21:14:26] The thing is that the changes do not seem to be persistent. [21:14:33] I could still make a screenshot of it both not being loaded and being loaded [21:15:02] Oh, I just ran the '# keyholder arm' process, that's why it shows it's active now. [21:15:16] But I ran the same process several times hehe. [21:15:19] I'm not sure how to make the changes persistent. [21:15:20] ok, but you already did this a couple times before, right? [21:15:33] when does it disappear again? [21:15:34] Yes, I did it about 3 times. [21:15:38] after the next puppet run? [21:15:57] this definitely counts as "weird" already [21:16:13] I haven't see this with deploy1002 afair [21:16:17] That's an excellent question, I'm not sure yet. I'll be paying attetion as to what triggers it as it's unexpected behavior. [21:16:47] I also think it may be related to this issue which topranks is helping me to debug. :D https://phabricator.wikimedia.org/T314936 [21:17:12] I think it looks like that issue is solved - just dm'd you [21:18:04] so you originally asked this because of the "Could not create directory '/var/lib/rancid/.ssh' (Permission denied)." [21:18:08] When systemd runs the command it exports an env variable for SSH_AUTH_SOCK, after which passwordless ssh works [21:18:23] So I think you may have fixed it changing the permissions for /var/lib/rancid dir? [21:18:53] when I tried that command "sudo -u rancid jlogin -c "show version" cr1-eqiad.wikimedia.org" I get a Password: prompt [21:18:58] keyholder is armed [21:19:08] mutante: yeah, same for us [21:20:03] Could it be that the password belongs to the 'rancid' user and not to the servers? [21:22:06] If you run a shell as the rancid user, export that env var, then the jlogin command works [21:22:11] https://www.irccloud.com/pastebin/vkJ6HEKF/ [21:22:27] The systemd unit that runs rancid sets this before it executes. So all is working [21:22:48] sudo -u rancid SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/rancid cr1-eqiad.wikimedia.org [21:22:51] this works too [21:23:15] I believe the fix is that /var/lib/rancid is now owned by "rancid" user not root, so the ssh is not bombing out trying to write to that dir [21:23:49] mutante: cool, I was trying to do that myself but the syntax proved elusive :) [21:24:19] on netmon1002 it is root:root but on meton1003 it is rancid:rancid [21:24:21] for the rancid home [21:24:44] topranks: Cool! I changed the directory's owner to test if it worked, glad it did. I'll add it to the puppet repository. [21:25:01] topranks: lucky to find it at https://wikitech.wikimedia.org/wiki/Keyholder#Hints [21:25:08] yeah I think that is the fix [21:25:13] agreed [21:25:25] I assume we are just installing from upstream debian package ? [21:25:37] Which is creating the systemd service and timer? [21:25:46] Or are we adding those ourselves? [21:25:50] mutante: That's correct however, topranks made a very good observation in that in the 'netmon1002' instance 'rancid' runs as a cron job which is ran as the 'root' user and in the 'netmon1003' instance 'rancid' runs as a systemd service that runs as the 'rancid' user. [21:25:59] Seems the difference from netmon1002 is that it was a cronjob, running as root, before. [21:26:08] It's not a systemd unit, but set to run as user 'rancid' [21:26:22] haha... yep that :) [21:26:23] we are creating systemd timer jobs "rancid-differ" [21:26:30] ok cool [21:26:48] well no need to upstream any fix, we just need to set the owner of the dir correctly to match the systemd unit [21:26:52] it all makes sense now [21:26:57] because crons were migrated to timers [21:27:47] cool cool. yep all makes sense in the end :) [21:27:51] yes, it has "user => 'rancid' for the timer and before it was root then [21:27:54] denisse|m: ACK :) [21:28:14] I'll send a small puppet patch that fixes the directories owner for when it's a Debian Bullseye instance so we won't have this issue in the future. ^^ [21:28:14] Thanks a lot for your help and for sharing your insights!! <3 [21:28:41] blame https://phabricator.wikimedia.org/T273673 :) [21:29:14] Timers are much nicer because monitoring than crons [21:29:17] denisse|m: it's actually my fault [21:29:18] https://gerrit.wikimedia.org/r/c/operations/puppet/+/721854/3/modules/rancid/manifests/init.pp [21:29:23] And it's easier to see last run / next run [21:29:42] wait.. it said "rancid" before too...but yea [21:30:20] RhonosF1: Agreed, I was skeptical at first but I prefer systemd timers. [21:30:32] there must have been another change before that that change it from root to rancid I guess [21:30:49] topranks: the one thing I never liked about crons was how invisible they were [21:31:03] Until someone moaned a while later that something wasn't working [21:31:16] exactly [21:31:26] mutante: yeah that is odd [21:31:39] at least not visible to the person creating them.. then very visible for root@ [21:32:04] mutante: well ye if you allow root@ spam then they are [21:32:09] topranks: it is. But I think we can agree "root:root" is not intended [21:32:29] mutante: checking again the cronjob was actually configured to run as user rancid [21:32:35] Native systemd is much cleaner imo [21:32:51] But somehow on netmon1002 that didn't cause a problem with /var/lib/rancid owned by root:root [21:32:55] but we can probably still blame the cron->timer change [21:33:14] not 100% satisfied though :) [21:33:41] Yeah. On netmon1002 there is no directory "/var/lib/rancid/.ssh" [21:33:58] It was netmon1003 trying to create that that was failing and causing the problem [21:34:00] because only the "active_server" gets it? [21:34:20] I checked /etc/ssh/ssh_config to see if host key saving/checking was disabled or anything but they are both the same. [21:34:34] I do expect it's some quirk in how SSH is being invoked due to change from cron to timer [21:39:18] Ok, think I see the difference in behaviour anyway. [21:39:24] https://phabricator.wikimedia.org/T314936#8144068 [21:40:04] Seems it was previously failing to create the directory, but jlogin / ssh proceeded afterwards, and on newer box it bombed out when it couldn't create the ".ssh" dir [21:41:26] I'll have a chat with Arzhel tomorrow on this. [21:41:31] ok, thanks topranks. then it should be the ssh version changes in bullseye [21:41:48] Perhaps we don't want to save the host keys, as it might fail if we replace a network device. [21:42:09] But obviously it's probably a good thing to store then, in case something nefarious occurs [21:42:46] mutante: yes quite likely [22:01:21] ssh 7.9: [22:01:38] if (mkdir(buf, 0700) < 0) [22:01:38] error("Could not create directory '%.200s'.", [22:01:41] buf); [22:01:48] ssh 8.4: [22:02:13] if (mkdir(dotsshdir, 0700) == -1) [22:02:13] error("Could not create directory '%.200s' (%s).", [22:02:16] dotsshdir, strerror(errno)); [22:02:54] (git clone https://salsa.debian.org/ssh-team/openssh ; git checkout bullseye vs git checkout buster) [22:03:33] previously this was in ssh.c and now it's in hostfile.c [22:05:04] hmm interesting. I note we only got that precise message on the bullseye host [22:05:39] Error logged, but didn't stop process, on buster was "Failed to add the host to the list of known hosts" [22:05:40] < 0 vs == 1 [22:06:09] == -1 [22:06:35] right.. [22:06:57] but yeah, if it's equal to -1 now should match <0 before, but I'm way out of my depth here :) [22:07:13] the pre-bullseye version does not have the "dotsshdir, strerror(errno));" part [22:07:26] I am satisfied enough now though :p [22:07:39] the whole code about that is in a different place now [22:07:57] haha yeah we're quite far down the rabbit hole. [22:08:13] I think we can be confident that a change in there has caused the different behaviour [22:08:14] confident enough to blame ssh and move on now :) [22:08:36] And likely we always should have had the ownership of that dir set correctly