[09:41:22] I'm disabling puppet in eqiad for ~ 5 mins [09:50:53] all back on [13:00:49] <_joe_> taavi: do we override $deployment_group in beta? [13:00:58] <_joe_> sorry, it's a hiera label [13:01:02] <_joe_> 'deployment_group' [13:05:32] _joe_: not according to https://codesearch.wmcloud.org/puppet/?q=deployment_group&i=nope&files=&excludeFiles=&repos=, which indexes ops/puppet.git and hiera configured via horizon [13:06:05] although I'm not very up to date on how deployment-prep does things [13:06:11] <_joe_> yeah I was searching in horizon too [13:06:22] <_joe_> taavi: I will verify by ssh'ing into the server :P [13:06:32] <_joe_> that's much safer than assume things from puppet/hiera [13:42:55] _joe_: if you are not allrerady aware you can git clone ssh://gerrit.wikimedia.org:29418/cloud/instance-puppet and grep in there for horizon config [13:43:32] <_joe_> yeah I am, I wasn't sure of the name of the repo anymore [13:54:13] Has anyone seen this type of error before? Looks to me like ECC RAM just doing its job, but wondering if it's an early sign of memory failure, maybe? https://phabricator.wikimedia.org/T308647 [13:56:20] inflatador: have you checked the logs directly on idrac? https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Show_logs if maybe they give any more insight [13:56:21] inflatador: yeah, I have seen that before in our databases [13:56:31] volans: I did, there was nothing [13:56:37] ack [13:57:01] inflatador: normally what dc ops do is swap the DIMM with another DIMM and see if it if fails again, either the same dimm or the slot itself [13:57:17] basically exchange existing dimms [13:58:08] works best with 3 DIMMs though :-P https://en.wikipedia.org/wiki/Three-card_Monte [14:01:46] I think usually, when you get error logs about single-bit errors, it's because it's seen multiple of them in the same "location" (dimm or chip -level, or row) [14:02:01] so it's a bit above the ECC just doing it's everyday job and correcting one-offs [14:02:31] once it's happening in a pattern, it's often and early-warning sign of worse failure [14:02:36] s/and/an/ [14:03:05] but yeah, there can be lots of causes, and it's a reasonable step to try re-seating and/or moving to diagnose and maybe-fix if you're lucky [14:09:43] ah thanks all! (esp marostegui for posting in the ticket) [14:41:35] <_joe_> I'm about to take over the deployment servers; during a brief interval, you won't be able to deploy mediawiki. Anyone against it? [14:51:17] <_joe_> ok, locked the deployment server [19:22:18] * jbond just realised i broke my Git/Reviewers regex which explains why i was on so many addtional reviews today :D [20:21:55] !log ganeti5003 updating firmware via T308211 [20:21:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:00] T308211: Upgrade ganeti/eqsin to Bullseye - https://phabricator.wikimedia.org/T308211 [21:59:38] mutante: denisse and I have been testing and we've found out that we're able to connect to mw2419.mgmt.codfw.wmnet on Debian oldstable and Debian stable using the exact same config from Arch. However, Debian *testing* is failing [22:00:52] I'd think it would be the openssh 8.x → 9.x update but manually installing 8.x on my arch machine still fails - perhaps it's a dep that's the issue? [22:00:58] We're still digging :) [22:07:35] I can't find anything relevant in the openssh v9.0 release notes. https://www.openssh.com/releasenotes.html [22:17:00] isn't that https://wikitech.wikimedia.org/wiki/Management_Interfaces#Does_IPMI_work_but_SSH_to_the_management_console_doesn%27t%3F ? [22:19:08] I'll help 10 people on how to earn $12,000 within 72 hours but you will pay me 10% of your profit when you receive it. [22:19:08] Note: only interested people should apply, click and drop a message let's get started [22:19:08] https://t.me/isheilabillyinvest [22:22:40] paravoid: Thanks but the problem is that we can't login using a password. For some reason openssh is defaulting to use a publickey even when the configuration specifies a password. [22:23:13] This is only present with the servers in the domain brett mentioned. [22:23:35] We can login with a publickey and our shell usernames in the other servers. [22:24:54] 8.9 on Arch didn't work either, so it's *possible* that the issue lies in a change between 8.4 (Debian stable) and 8.9 (the lowest release number I've been able to test on) [22:25:37] what does ssh -vv say? [22:29:06] QQ, do we have a preferred way of sharing text snippets like the output of ssh-vv? [22:29:31] most use https://phabricator.wikimedia.org/paste/ [22:29:36] Thank you! [22:29:45] (you can also set permissions to your pastes to NDA/wmf-only etc.) [22:31:04] https://paste.debian.net/plainh/f95541d6 [22:31:12] phab is noted [22:31:12] https://phabricator.wikimedia.org/P28158 [22:36:57] can you try with "ssh -o KbdInteractiveAuthentication=yes"? [22:38:01] if that doesn't work, then try with "ssh -vvv" (3 "v", so you get "debug3: ..." lines) [22:38:26] * denisse hides [22:38:31] paravoid: That works. [22:38:49] awesome :) [22:38:51] brett and I were debugging it when the issue was very simple. [22:38:58] Thank you very much for your help! :D [22:39:06] the question is why that is disabled -- it defaults to yes [22:41:09] wtf [22:41:19] yeah, that works ._. [22:42:19] 8.6 does it, it seems: https://bugzilla.mindrot.org/show_bug.cgi?id=3303 [22:42:49] wow, just saw this. very interesting. it was very odd that the same issue happened to both of you and you had in common that you are Arch users [22:43:07] Remember that it doesn't work on Debian testing either! [22:43:13] * denisse starts adding this to the Wiki [22:43:26] ack! it's just the newer version,ok [22:44:01] I'm glad we found the issue early. It was going to break once Debian rolls out the new openssh versions. [22:44:06] and you can both decrypt the "management" file in pwstore and use it to login on DRACs now? [22:44:25] I don't think we've gotten to the bottom of this yet, fwiw :) [22:44:27] Yep [22:44:28] yea, good point denisse [22:44:55] ssh_config(5) in openssh 9.0 still says that KbdInteractiveAuthentication defaults to yes [22:45:16] Oh dang, I didn't read that right. Hm. [22:45:29] Ah yeah, ChallengeResponseAuthentication was removed in 8.6 it seems [22:47:00] you saw the other ticket they mention? https://bugzilla.mindrot.org/show_bug.cgi?id=2408 [22:47:08] that seems to be the PAM part [22:48:26] mutante: Yes, I can decrypt the "management" password and login. [22:50:26] denisse: great:) here is more info what you can do there, btw https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Common_Actions [22:50:48] Thanks for your help on this, mutante :) [22:50:54] (if it's a Dell, only a few special ones are HP) [22:50:55] paravoid: You're right, the problem is that brett and I took our SSH configs from here: https://wikitech.wikimedia.org/wiki/SRE/Production_access#Setting_up_your_SSH_config [22:51:17] And that config has "KbdInteractiveAuthentication no". [22:51:35] holy shit you're right. How did we not see that? :O [22:52:34] aha! worth checking if that is also in the sre-laptop package. I don't have it in my config but used neither [22:52:54] I just checked and the sre-laptop package has a different configuration. [22:53:36] don't remove it entirely but add another section for "mgmt" hosts that sets it to yes? [22:54:00] I think it would be good to have a single source of truth for this so removing that config from the Wiki and adding a link to the sre-laptop package could be a good idea. What do you think? [22:54:11] mutante: That sounds good, thank you! [22:54:21] or we could copy the config from sre-laptop to the wiki example .. yea [22:54:58] That's good idea but I think it could cause us troubles in the future as we may forget to update the config in both places. [22:55:02] your idea is probably better because it can't unsync [22:55:36] maybe that is a candidate repo to import into gitlab [22:56:01] Ah, I'll just add a note that the most up to date version of that file is in the wmf-sre-laptop package and that the config shown in the Wiki is just an example. :) [22:57:43] sounds good. and/or you could add something like "warning, this will break with openssh 9.0 because..." [22:57:49] * denisse wonders if a code snippet from gerrit can be inserted into mediawiki [22:59:53] I don't think you can directly include it. that would need an MW extension. You can link to this though [22:59:56] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/configs/ssh-client-config [22:59:59] this is "gitiles" [23:00:00] I just tried with 9.0p1-1 and can confirm it works. The issue was the oepnssh config. [23:00:45] Thanks! That link would work.