[06:22:52] <marostegui>	 jbond: there are puppet changes pending from you
[09:31:09] <godog>	 FYI I've moved the "prometheus job unavailable" alerts from icinga to alertmanager just now, the outstanding alerts will fire again
[09:44:42] <Emperor>	 Q about decommissioning hosts. I've put in https://gerrit.wikimedia.org/r/c/operations/puppet/+/761283 to remove the hosts from puppet (which includes taking them out of conftool-data), and I've already depooled the target hosts. Are there other steps necessary before running the sre.hosts.decommission cookbook? The "remove from production" wikitech page has a note 'Remove from pybal/LVS (if applicable) - see the sre.hosts.reimage
[09:44:42] <Emperor>	 cookbook option -c/--conftool and consult the LVS page'
[09:46:16] <Emperor>	 running puppet-merge looks to have a done the conftool cleanup
[09:48:19] <Emperor>	 and e.g. 'confctl select dc=codfw,cluster=swift get' no longer lists the old hosts, so I think I'm OK, I'd just like a grownup to confirm I'm not missing something obvious :)
[09:52:28] <elukey>	 I think it is ok to proceed, if the hosts are not getting any traffic you are good to go (and puppet is already cleaned up)
[09:58:00] <moritzm>	 yeah, that's all you need on the conftool side, you can proceed with the decom
[10:00:48] <Emperor>	 thanks
[10:02:16] <jbond>	 masorry about that looks like its been merged now
[10:03:10] <jbond>	  marostegui:  was ment for you most have missed hitting tab :)
[10:03:22] <marostegui>	 jbond: yes, I merged it
[10:03:29] <jbond>	 gretatthanks
[10:03:33] <marostegui>	 np!
[10:22:32] <volans>	 Emperor: FYI ms-fe2005 did sent a bunch of cron-spam to root@
[10:23:43] <volans>	 and still sending, various per minute
[10:24:12] <volans>	 8/minute AFAICT
[10:24:51] <Emperor>	 I notice I need to move the stats reporter, and probably also restart swift-proxy on those old nodes
[10:26:05] <Emperor>	 stats reporter> https://gerrit.wikimedia.org/r/c/operations/puppet/+/761293
[10:26:39] <Emperor>	 [I have updated the swift/howto to add a note that this needs moving also]
[10:28:25] <Emperor>	 godog: you OK to +1 that CR so I can move the stats reporter to ms-fe2009 please? Then I can try and shut up ms-fe2005
[10:29:38] <godog>	 Emperor: yeah LGTM
[10:30:16] <apergos>	 ah this is the cronspam? awesome, I will be happy to see it squelched
[10:30:26] <Emperor>	 sorry!
[10:30:42] <apergos>	 as long as it's being worked on, all good
[10:32:35] <Emperor>	 puppet run on old and new node, so the crons/timers should have moved over
[10:32:51] <apergos>	 \o/
[10:33:46] <Emperor>	 have also noted the need to move this when decommissioning swift proxies in future...
[10:55:39] <godog>	 I'm mildly confused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/761294 since I was expecting moving prometheus_nodes to role-based hiera to work, looks like it doesn't ? https://puppet-compiler.wmflabs.org/pcc-worker1003/33641/cloudcephmon1003.eqiad.wmnet/index.html
[11:22:16] <_joe_>	 Emperor: sorry I was still knee deep in spicerack's code, you did everything correctly
[11:40:07] <question_mark>	 what are you saying, wading through all that mud? ;p
[11:45:28] <_joe_>	 question_mark: it is painful, but to be fair it's a certain satisfaction when dodgy, mccabe, pep257, pep8, profile-validator, pyflakes, pylint, pyroma, vulture say your code is ok
[11:45:48] <_joe_>	 (that's just from prospector, of course we have more linters/checkers)
[11:50:46] * volans not taking the bait
[11:51:03] * volans reviewing the code instead... wait for the volint
[11:52:52] <question_mark>	 the .8 volfloat
[12:01:51] <_joe_>	 volans: I would prefer if you concentrated on the code structure first, given this is more or less a POC
[12:02:20] <_joe_>	 I've implemented basically the libraries for one out of all the cookbooks we plan to write
[12:02:22] <volans>	 _joe_: for that I would need first a crash course in k8s api
[12:02:35] <_joe_>	 sure!
[12:02:55] <_joe_>	 AMA
[12:03:27] <volans>	 mandatory: https://www.amaroma.it/images-new/assets/loghi/logo-amaroma-home.png
[12:05:17] <_joe_>	 ahahaha
[12:05:28] <_joe_>	 noone else will get how appropriat it is
[12:05:31] <_joe_>	 *e
[12:13:45] <volans>	 _joe_: I think that the general structure seems ok for now, I really like the Spicerack.kubernetes() accessor implementation :-P
[12:14:42] <_joe_>	 heh yeah it's not like there's much to configure
[12:15:30] <_joe_>	 oh heh commit fail I guess :D
[12:15:44] <_joe_>	 it was suppsoed to just return an instance of the Kubernetes class :P
[12:15:56] <volans>	 yeah I know :-P
[12:16:16] <volans>	 btw I hate that gerrit gives me the option to publish either all or none of my comments
[12:16:24] <volans>	 I would like to publish just some of them
[12:16:45] <_joe_>	 add [volint] in front of the others
[12:16:47] <_joe_>	 :D
[12:21:23] <volans>	 {done} all minors shouild have a [...] 
[12:21:26] <volans>	 prefix
[12:26:11] <arturo>	 can I have a look at that code? cc _joe_ volans 
[12:26:44] <arturo>	 not for review, for to learn from the implementation details
[12:28:09] <arturo>	 s/for/but/
[12:29:57] <_joe_>	 arturo: https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/761297 
[12:30:08] <arturo>	 excellent
[12:30:13] <arturo>	 thanks
[12:30:21] <_joe_>	 if you see something wrong, please comment!
[12:30:30] <taavi>	 can someone explain why I'm seeing "ERROR profile::prometheus::ops not in autoload module layout (autoloader_layout)" on https://gerrit.wikimedia.org/r/c/operations/puppet/+/761313/
[12:31:02] <taavi>	 ah.. probably a typo in the hiera key :/
[12:31:29] <jynus>	 yeah, some kind on unrelated code causing syntax error
[12:31:34] <jynus>	 (probably)
[12:41:47] <jbond>	 taavi: i just hit that my self can you try a rebase and see if it fixes it
[12:42:02] <taavi>	 jbond: fixed itself on a second PS
[12:42:12] <jbond>	 ci should have that check disabled both explicitly and by puppetlabs/spec_helper_rake_tasts so its a wiered one
[12:42:15] <jbond>	 ack
[12:42:28] <jbond>	 but i did make changes here yesterday so...
[12:56:50] <taavi>	 jbond: I'm also getting a "Whoops! It looks like puppet-lint has encountered an error that it doesn't know how to handle." on https://gerrit.wikimedia.org/r/c/operations/puppet/+/761315/ which persists with a recheck
[13:05:26] <jbond>	 taavi: ack looking now
[13:31:19] <jbond>	 taaivi should be good now
[13:31:25] <jbond>	 taavi: even :)
[13:31:33] <taavi>	 thx!
[13:32:55] <godog>	 sigh, still scratching my head on why https://gerrit.wikimedia.org/r/c/operations/puppet/+/761294 doesn't seem to apply the role hiera, at least according to pcc https://puppet-compiler.wmflabs.org/pcc-worker1003/33641/cloudcephmon1003.eqiad.wmnet/index.html
[13:34:24] <godog>	 my understanding is that it should work as intended, i.e. the role hiera has precedence over e.g. hieradata/eqiad.yaml
[13:41:58] <godog>	 or likely my understanding is wrong heh
[13:43:25] <jbond>	 godog: i just checked and the site layer of hiera takes precendence over the role/site.  personally i would say this is not ideal but its been this way for some time https://gerrit.wikimedia.org/r/c/operations/puppet/+/566559/60/modules/puppetmaster/files/production.hiera.yaml 
[13:45:12] <godog>	 jbond: thank you for double checking, yeah I didn't expect it either heh
[13:45:35] <godog>	 that's a pandora's box I don't want to open atm though
[13:55:29] <jbond>	 yes exactly :)
[13:56:01] <jbond>	 the public vs private prefrence is the wrong way round imo as well but thats likley a bigger can of worms
[13:56:39] * godog sighs in hiera
[13:57:16] <sukhe>	 godog: surely you mean sighs::in:hiera :P
[13:57:30] <godog>	 haha! well played sir
[14:17:07] <jbond>	 godog: fyi: https://phabricator.wikimedia.org/T301349 has changes allready attached.  however reating the CR's is the easy bit here and testing is the tricky bit
[14:50:54] <godog>	 jbond: oohh very nice, thank you for kickstarting that
[14:53:51] <jbond>	 fyi pcc for the role change came up noop so im tempted to deploy if no one shoots https://puppet-compiler.wmflabs.org/pcc-worker1002/33656/
[14:54:14] <jbond>	 shouts (or shoots i gusse :))
[14:54:34] <godog>	 lol
[14:55:05] <godog>	 the pcc result is encouraging for sure, can't be certain one way or another (and have to jump in a meeting in five)
[20:52:32] <inflatador>	 taavi or anyone else, small PR for adding myself and ryankemper to the deployment-prep authorized keys: https://gerrit.wikimedia.org/r/c/labs/private/+/761465/1/modules/passwords/templates/root-authorized-keys.erb TIA~
[20:56:05] <bd808>	 inflatador: yeah... that does not do what you wanted to do. See my comment on where you really need to add these keys
[20:57:03] <inflatador>	 :eyes
[20:59:11] <taavi>	 inflatador: sort of related: what do you mean by "login locally" on https://phabricator.wikimedia.org/T299797#7691081?
[21:11:06] <inflatador>	 taavi  bd808  I'm planning on meeting with ryankemper in about 20m (2130 UTC) . If y'all have time to join, would love your help in getting unblocked on this
[21:14:29] <bd808>	 inflatador: what specifically do you need help with? I'm not really understanding the problem statement in T301408 or finding more details in T299797
[21:14:30] <stashbot>	 T301408: Enable search team SRE access to deployment-prep VMs - https://phabricator.wikimedia.org/T301408
[21:14:30] <stashbot>	 T299797: Deploy new elastic cluster nodes on deployment-prep - https://phabricator.wikimedia.org/T299797
[21:16:38] <inflatador>	 bd808 I can't login. Will try to add the keys in as you suggested in your comment and get back to you
[21:17:27] <bd808>	 inflatador: which host are you failing to login to? deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud?
[21:18:49] <bd808>	 The task mentions deployment-elastic00 but there does not seem to be any instance with that name in the project
[21:19:18] <inflatador>	 yeah, any new VMs running bullseye. I will update https://phabricator.wikimedia.org/T299797 to make it more clear, but basically we need to test our elasticsearch deploys on bullseye
[21:21:52] <bd808>	 inflatador: the auth.log on deployment-elastic11 is full of things like "ailed publickey for bking from 172.16.5.8 port 43492 ssh2: ED25519 SHA256:iZFGqiIXaMe6xnjlXEBt2KRV3beFC4TqmTxCnXJKIZU"
[21:22:29] <bd808>	 are you maybe not sending the expected key? Like you prod key instead of your WMCS key?
[21:22:50] <inflatador>	 bd808 99% sure that's not it, I believe it's that the puppet run blocks LDAP from working
[21:23:12] <bd808>	 no, I logged in normally
[21:23:46] <inflatador>	 interesting! I couldn't get it to work before
[21:23:57] <bd808>	 meaning I am now in that instance via ssh using my normal mortal key and username
[21:24:00] <inflatador>	 and that includes manually specifying my cloud key with -i
[21:25:58] <bd808>	 inflatador: according to LDAP, your expected ssh key has the public fingerprint of "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGbKzQhw7ByKT4olc+tpDF5cKWaUBgUyrrFICrRoS6IR bking@wikimedia.org cloud"
[21:26:11] <inflatador>	 so I manually added my exact pubkey to to the root file in /etc/ssh/userkeys or whatever it is
[21:26:17] <inflatador>	 and 'ssh -i ~/.ssh/bking_other_ed25519 -l root deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud' works
[21:26:39] <inflatador>	 but 'ssh -i ~/.ssh/bking_other_ed25519 deployment-elastic11.deployment-prep.eqiad1.wikimedia.cloud' does not, do I need to use 'bking@wikimedia.org' as the user or something?
[21:26:55] <bd808>	 just 'bking'
[21:27:35] <inflatador>	 let me double check that pub key
[21:30:00] <inflatador>	 key looks identical to me, here it is from my laptop: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGbKzQhw7ByKT4olc+tpDF5cKWaUBgUyrrFICrRoS6IR bking@wikimedia.org cloud
[21:30:59] * bd808 sees a new failed ssh session as root
[21:33:03] <taavi>	 probably me typoing my config
[21:34:07] <taavi>	 bd808: it's exactly what I said in https://phabricator.wikimedia.org/T299797#7642181, you can log in because you're a member of the admin project
[21:34:41] <taavi>	 have a look at /etc/security/access.conf, the first puppet run is supposed to fix that
[21:34:52] <bd808>	 taavi: hmmm... so puppet has never properly run here at all then?
[21:35:19] <taavi>	 correct
[21:35:37] <inflatador>	 don't spend too much time on this one box, if I can gain access without LDAP via the hiera stuff bd808 mentioned, that's good enough
[21:36:06] <taavi>	 it won't work either if the first puppet run is failing
[21:37:19] <taavi>	 and the same thing applies to adding your keys to labs/private unless we would rebuild the base images and those VMs would be recreated after that
[21:37:27] <bd808>	 ok. i'm catching up. There is prefix puppet stuff attempting to apply role::elasticsearch::beta and that is failing for misconfiguration related to the tlsproxy stuff I guess
[21:37:34] <inflatador>	 Kind of a catch-22, I need access to fix our puppet ;) . But I can get in via cloud-init
[21:38:29] <inflatador>	 so I can just append my key into that root file in the short term 
[21:38:45] <inflatador>	 Open to suggestions if y'all have a better way though
[21:38:48] <taavi>	 it won't wont work either as I just said?
[21:39:00] <bd808>	 inflatador: you could also work on the puppet by making an instance that _does not_ have a name matching existing prefix puppet rules and then manually apply the role after initial puppet runs
[21:39:01] <inflatador>	 cloud-init works
[21:39:54] <bd808>	 yeah, if the cloud-init hack is how you appended to /etc/ssh/userkeys/root then that should work
[21:40:21] <bd808>	 ugly, but so is prefix puppet matching when trying to work on brand new OS versions :)
[21:40:34] <inflatador>	 yeah, especially when you know as little about puppet as I do ;)
[21:40:36] <taavi>	 (or move the role definition to individual instances from the prefix, and only apply the role to new instances after the first puppet run is complete)
[21:41:47] <inflatador>	 Thanks guys, feeling much better now!
[21:42:03] <bd808>	 "we" (meaning a.ndrewbogott really) went far into a rabbit hole in the past trying to figure out how to separate the first puppet run from per-project config, but that ultimately did not work.
[21:44:41] <taavi>	 bd808: now I'm wondering if I could patch the ENC api to not return anything for the first time an api request comes in for an instance.. it would be ugly, but it could work
[21:46:03] <bd808>	 taavi: hmmm... maybe. Stateful tracking in the ENC sounds fragile, but if there is a real signal that it could switch on that might work.