[06:22:52] jbond: there are puppet changes pending from you [09:31:09] FYI I've moved the "prometheus job unavailable" alerts from icinga to alertmanager just now, the outstanding alerts will fire again [09:44:42] Q about decommissioning hosts. I've put in https://gerrit.wikimedia.org/r/c/operations/puppet/+/761283 to remove the hosts from puppet (which includes taking them out of conftool-data), and I've already depooled the target hosts. Are there other steps necessary before running the sre.hosts.decommission cookbook? The "remove from production" wikitech page has a note 'Remove from pybal/LVS (if applicable) - see the sre.hosts.reimage [09:44:42] cookbook option -c/--conftool and consult the LVS page' [09:46:16] running puppet-merge looks to have a done the conftool cleanup [09:48:19] and e.g. 'confctl select dc=codfw,cluster=swift get' no longer lists the old hosts, so I think I'm OK, I'd just like a grownup to confirm I'm not missing something obvious :) [09:52:28] I think it is ok to proceed, if the hosts are not getting any traffic you are good to go (and puppet is already cleaned up) [09:58:00] yeah, that's all you need on the conftool side, you can proceed with the decom [10:00:48] thanks [10:02:16] masorry about that looks like its been merged now [10:03:10] marostegui: was ment for you most have missed hitting tab :) [10:03:22] jbond: yes, I merged it [10:03:29] gretatthanks [10:03:33] np! [10:22:32] Emperor: FYI ms-fe2005 did sent a bunch of cron-spam to root@ [10:23:43] and still sending, various per minute [10:24:12] 8/minute AFAICT [10:24:51] I notice I need to move the stats reporter, and probably also restart swift-proxy on those old nodes [10:26:05] stats reporter> https://gerrit.wikimedia.org/r/c/operations/puppet/+/761293 [10:26:39] [I have updated the swift/howto to add a note that this needs moving also] [10:28:25] godog: you OK to +1 that CR so I can move the stats reporter to ms-fe2009 please? Then I can try and shut up ms-fe2005 [10:29:38] Emperor: yeah LGTM [10:30:16] ah this is the cronspam? awesome, I will be happy to see it squelched [10:30:26] sorry! [10:30:42] as long as it's being worked on, all good [10:32:35] puppet run on old and new node, so the crons/timers should have moved over [10:32:51] \o/ [10:33:46] have also noted the need to move this when decommissioning swift proxies in future... [10:55:39] I'm mildly confused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/761294 since I was expecting moving prometheus_nodes to role-based hiera to work, looks like it doesn't ? https://puppet-compiler.wmflabs.org/pcc-worker1003/33641/cloudcephmon1003.eqiad.wmnet/index.html [11:22:16] <_joe_> Emperor: sorry I was still knee deep in spicerack's code, you did everything correctly [11:40:07] what are you saying, wading through all that mud? ;p [11:45:28] <_joe_> question_mark: it is painful, but to be fair it's a certain satisfaction when dodgy, mccabe, pep257, pep8, profile-validator, pyflakes, pylint, pyroma, vulture say your code is ok [11:45:48] <_joe_> (that's just from prospector, of course we have more linters/checkers) [11:50:46] * volans not taking the bait [11:51:03] * volans reviewing the code instead... wait for the volint [11:52:52] the .8 volfloat [12:01:51] <_joe_> volans: I would prefer if you concentrated on the code structure first, given this is more or less a POC [12:02:20] <_joe_> I've implemented basically the libraries for one out of all the cookbooks we plan to write [12:02:22] _joe_: for that I would need first a crash course in k8s api [12:02:35] <_joe_> sure! [12:02:55] <_joe_> AMA [12:03:27] mandatory: https://www.amaroma.it/images-new/assets/loghi/logo-amaroma-home.png [12:05:17] <_joe_> ahahaha [12:05:28] <_joe_> noone else will get how appropriat it is [12:05:31] <_joe_> *e [12:13:45] _joe_: I think that the general structure seems ok for now, I really like the Spicerack.kubernetes() accessor implementation :-P [12:14:42] <_joe_> heh yeah it's not like there's much to configure [12:15:30] <_joe_> oh heh commit fail I guess :D [12:15:44] <_joe_> it was suppsoed to just return an instance of the Kubernetes class :P [12:15:56] yeah I know :-P [12:16:16] btw I hate that gerrit gives me the option to publish either all or none of my comments [12:16:24] I would like to publish just some of them [12:16:45] <_joe_> add [volint] in front of the others [12:16:47] <_joe_> :D [12:21:23] {done} all minors shouild have a [...] [12:21:26] prefix [12:26:11] can I have a look at that code? cc _joe_ volans [12:26:44] not for review, for to learn from the implementation details [12:28:09] s/for/but/ [12:29:57] <_joe_> arturo: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/761297 [12:30:08] excellent [12:30:13] thanks [12:30:21] <_joe_> if you see something wrong, please comment! [12:30:30] can someone explain why I'm seeing "ERROR profile::prometheus::ops not in autoload module layout (autoloader_layout)" on https://gerrit.wikimedia.org/r/c/operations/puppet/+/761313/ [12:31:02] ah.. probably a typo in the hiera key :/ [12:31:29] yeah, some kind on unrelated code causing syntax error [12:31:34] (probably) [12:41:47] taavi: i just hit that my self can you try a rebase and see if it fixes it [12:42:02] jbond: fixed itself on a second PS [12:42:12] ci should have that check disabled both explicitly and by puppetlabs/spec_helper_rake_tasts so its a wiered one [12:42:15] ack [12:42:28] but i did make changes here yesterday so... [12:56:50] jbond: I'm also getting a "Whoops! It looks like puppet-lint has encountered an error that it doesn't know how to handle." on https://gerrit.wikimedia.org/r/c/operations/puppet/+/761315/ which persists with a recheck [13:05:26] taavi: ack looking now [13:31:19] taaivi should be good now [13:31:25] taavi: even :) [13:31:33] thx! [13:32:55] sigh, still scratching my head on why https://gerrit.wikimedia.org/r/c/operations/puppet/+/761294 doesn't seem to apply the role hiera, at least according to pcc https://puppet-compiler.wmflabs.org/pcc-worker1003/33641/cloudcephmon1003.eqiad.wmnet/index.html [13:34:24] my understanding is that it should work as intended, i.e. the role hiera has precedence over e.g. hieradata/eqiad.yaml [13:41:58] or likely my understanding is wrong heh [13:43:25] godog: i just checked and the site layer of hiera takes precendence over the role/site. personally i would say this is not ideal but its been this way for some time https://gerrit.wikimedia.org/r/c/operations/puppet/+/566559/60/modules/puppetmaster/files/production.hiera.yaml [13:45:12] jbond: thank you for double checking, yeah I didn't expect it either heh [13:45:35] that's a pandora's box I don't want to open atm though [13:55:29] yes exactly :) [13:56:01] the public vs private prefrence is the wrong way round imo as well but thats likley a bigger can of worms [13:56:39] * godog sighs in hiera [13:57:16] godog: surely you mean sighs::in:hiera :P [13:57:30] haha! well played sir [14:17:07] godog: fyi: https://phabricator.wikimedia.org/T301349 has changes allready attached. however reating the CR's is the easy bit here and testing is the tricky bit [14:50:54] jbond: oohh very nice, thank you for kickstarting that [14:53:51] fyi pcc for the role change came up noop so im tempted to deploy if no one shoots https://puppet-compiler.wmflabs.org/pcc-worker1002/33656/ [14:54:14] shouts (or shoots i gusse :)) [14:54:34] lol [14:55:05] the pcc result is encouraging for sure, can't be certain one way or another (and have to jump in a meeting in five) [20:52:32] taavi or anyone else, small PR for adding myself and ryankemper to the deployment-prep authorized keys: https://gerrit.wikimedia.org/r/c/labs/private/+/761465/1/modules/passwords/templates/root-authorized-keys.erb TIA~ [20:56:05] inflatador: yeah... that does not do what you wanted to do. See my comment on where you really need to add these keys [20:57:03] :eyes [20:59:11] inflatador: sort of related: what do you mean by "login locally" on https://phabricator.wikimedia.org/T299797#7691081? [21:11:06] taavi bd808 I'm planning on meeting with ryankemper in about 20m (2130 UTC) . If y'all have time to join, would love your help in getting unblocked on this [21:14:29] inflatador: what specifically do you need help with? I'm not really understanding the problem statement in T301408 or finding more details in T299797 [21:14:30] T301408: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408 [21:14:30] T299797: Deploy new elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797 [21:16:38] bd808 I can't login. Will try to add the keys in as you suggested in your comment and get back to you [21:17:27] inflatador: which host are you failing to login to? deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud? [21:18:49] The task mentions deployment-elastic00 but there does not seem to be any instance with that name in the project [21:19:18] yeah, any new VMs running bullseye. I will update https://phabricator.wikimedia.org/T299797 to make it more clear, but basically we need to test our elasticsearch deploys on bullseye [21:21:52] inflatador: the auth.log on deployment-elastic11 is full of things like "ailed publickey for bking from 172.16.5.8 port 43492 ssh2: ED25519 SHA256:iZFGqiIXaMe6xnjlXEBt2KRV3beFC4TqmTxCnXJKIZU" [21:22:29] are you maybe not sending the expected key? Like you prod key instead of your WMCS key? [21:22:50] bd808 99% sure that's not it, I believe it's that the puppet run blocks LDAP from working [21:23:12] no, I logged in normally [21:23:46] interesting! I couldn't get it to work before [21:23:57] meaning I am now in that instance via ssh using my normal mortal key and username [21:24:00] and that includes manually specifying my cloud key with -i [21:25:58] inflatador: according to LDAP, your expected ssh key has the public fingerprint of "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGbKzQhw7ByKT4olc+tpDF5cKWaUBgUyrrFICrRoS6IR bking@wikimedia.org cloud" [21:26:11] so I manually added my exact pubkey to to the root file in /etc/ssh/userkeys or whatever it is [21:26:17] and 'ssh -i ~/.ssh/bking_other_ed25519 -l root deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud' works [21:26:39] but 'ssh -i ~/.ssh/bking_other_ed25519 deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud' does not, do I need to use 'bking@wikimedia.org' as the user or something? [21:26:55] just 'bking' [21:27:35] let me double check that pub key [21:30:00] key looks identical to me, here it is from my laptop: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGbKzQhw7ByKT4olc+tpDF5cKWaUBgUyrrFICrRoS6IR bking@wikimedia.org cloud [21:30:59] * bd808 sees a new failed ssh session as root [21:33:03] probably me typoing my config [21:34:07] bd808: it's exactly what I said in https://phabricator.wikimedia.org/T299797#7642181, you can log in because you're a member of the admin project [21:34:41] have a look at /etc/security/access.conf, the first puppet run is supposed to fix that [21:34:52] taavi: hmmm... so puppet has never properly run here at all then? [21:35:19] correct [21:35:37] don't spend too much time on this one box, if I can gain access without LDAP via the hiera stuff bd808 mentioned, that's good enough [21:36:06] it won't work either if the first puppet run is failing [21:37:19] and the same thing applies to adding your keys to labs/private unless we would rebuild the base images and those VMs would be recreated after that [21:37:27] ok. i'm catching up. There is prefix puppet stuff attempting to apply role::elasticsearch::beta and that is failing for misconfiguration related to the tlsproxy stuff I guess [21:37:34] Kind of a catch-22, I need access to fix our puppet ;) . But I can get in via cloud-init [21:38:29] so I can just append my key into that root file in the short term [21:38:45] Open to suggestions if y'all have a better way though [21:38:48] it won't wont work either as I just said? [21:39:00] inflatador: you could also work on the puppet by making an instance that _does not_ have a name matching existing prefix puppet rules and then manually apply the role after initial puppet runs [21:39:01] cloud-init works [21:39:54] yeah, if the cloud-init hack is how you appended to /etc/ssh/userkeys/root then that should work [21:40:21] ugly, but so is prefix puppet matching when trying to work on brand new OS versions :) [21:40:34] yeah, especially when you know as little about puppet as I do ;) [21:40:36] (or move the role definition to individual instances from the prefix, and only apply the role to new instances after the first puppet run is complete) [21:41:47] Thanks guys, feeling much better now! [21:42:03] "we" (meaning a.ndrewbogott really) went far into a rabbit hole in the past trying to figure out how to separate the first puppet run from per-project config, but that ultimately did not work. [21:44:41] bd808: now I'm wondering if I could patch the ENC api to not return anything for the first time an api request comes in for an instance.. it would be ugly, but it could work [21:46:03] taavi: hmmm... maybe. Stateful tracking in the ENC sounds fragile, but if there is a real signal that it could switch on that might work.