[09:16:58] <marostegui>	 godog: there's a change from you pending to be merged
[09:20:28] <godog>	 marostegui: doh! sorry
[09:20:49] <godog>	 {{done}}
[09:21:06] <marostegui>	 <3
[10:07:18] <marostegui>	 jbond: did you perhaps forget to write: yes on the puppet-merge? :) the process has been locked for a while 
[10:09:51] <jbond>	 marostegui: done sorry 
[10:09:59] <marostegui>	 thank you! merging my change then
[15:15:02] <elukey>	 hey folks
[15:15:13] <elukey>	 I am reimaging an-worker1132 and this step is getting ages
[15:15:13] <elukey>	 Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title an-worker1132 not found yet
[15:15:24] <elukey>	 has it happened before?
[15:15:46] <elukey>	 IIUC the function calls the puppetdb's api and it gets the above answer
[15:32:12] <hashar>	 in Puppet is there a way to share hiera settings between roles? I have a change proposal to have ci::manager and ci::worker roles  and I can't find a way to share year settings between them.
[15:32:17] <hashar>	 I have tried `hieradata/role/common/ci.yaml` and `hieradata/role/common/ci/common.yaml`  but those do not seem to be part of the lookup hierarchy.
[15:32:28] <hashar>	 (the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/907886 )
[15:34:17] <sukhe>	 elukey: is this a host that you tried to previously reimage?
[15:34:31] <sukhe>	 IIRC I got that once and I passed --new but I am not a 100% :)
[15:35:04] <elukey>	 sukhe: o/ it was an host down for days, I passed --new since it wasn't in puppetdb anymore
[15:35:17] <sukhe>	 ok sorry then :)
[15:35:25] <elukey>	 hashar: you probably need to share it via global config
[15:35:34] <elukey>	 sukhe: nono thanks for the suggestion! Anything is useful
[15:35:41] <elukey>	 I am checking the cookbook's code atm
[15:36:16] <elukey>	 ah ok so I ran manually puppet agent --noop etc.. on the node  via install_console, and that did the trick
[15:36:40] <sukhe>	 interesting!
[15:36:54] <jbond>	 elukey: dose it give any indication that the cook book run of that failed?
[15:36:59] <jbond>	 there is nothing in the logs
[15:37:00] <elukey>	 (basically what _populate_puppetdb does)
[15:37:12] <elukey>	 jbond: lemme recheck, but I didn't see anything
[15:37:36] <elukey>	 so I see:
[15:37:37] <elukey>	 Signed new Puppet certificate                                                                                                                  ----- OUTPUT of 'puppet agent -t --noop &> /dev/null' -----
[15:37:41] <elukey>	 ================                                                                                                                               100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet agent -t --noop &> /dev/null'.
[15:37:45] <elukey>	 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.                                                  Run Puppet in NOOP mode to populate exported resources in PuppetDB
[15:37:49] <elukey>	 [1/50, retrying in 3.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb.<locals>.poll_puppetdb' raised: Nagios_host resource with title an-worker1132 not found yet   
[15:37:53] <elukey>	 [..]
[15:37:57] <jbond>	 strange
[15:38:00] <elukey>	 yeah
[15:38:41] <volans>	 what's up?
[15:38:46] <jbond>	 hashar: ill comment on the change
[15:38:55] <jbond>	 volans: the noop pupet run failed to pupulate puppetdb
[15:39:00] <jbond>	 running it maualy fixed things
[15:39:01] <hashar>	 great thanks jbond :)
[15:39:11] <jbond>	 logs all look like everything worked ok
[15:39:17] <volans>	 ah interesting, because I was about to say that mos tlikely puppet was broken
[15:39:27] <volans>	 as we have to consider any exit code as successful from noop
[15:39:39] <volans>	 and we send the output to dev/null for not spamming the console
[15:39:58] <jbond>	 ahh ok so the puppet run could have failed possibly even for some transient issue
[15:40:37] <elukey>	 could make sense, the reuse recipe for the node didn't take into account the 4TB disks so I used a little script via install_console to fix them
[15:40:49] <elukey>	 (basically to populate fstab)
[15:41:07] <elukey>	 and they are referenced in the code, maybe this is the culprit
[15:41:34] <jbond>	 yes that could be it, syslog would probably have the puppet error
[15:41:48] <elukey>	 but at this point an error msg with "step failed, do you want to retry the puppet run?" could be great
[15:41:50] <jbond>	 actully maybe not
[15:41:51] <elukey>	 if possible
[15:42:40] <vgutierrez>	 https://puppet-compiler.wmflabs.org/output/907912/40594/ is reporting a NOOP but I'm getting a warning on both production/change catalog saying Warning: Failed to compile catalog for node cp5020.eqsin.wmnet: source sequence is illegal/malformed utf-8
[15:49:53] <jbond>	 vgutierrez: seems to be hitting a bug in pcc ill take a look 
[15:50:07] <vgutierrez>	 jbond: cheers <3
[15:50:20] <vgutierrez>	 and sorry for breaking pcc
[15:51:23] <jbond>	 np
[16:02:38] <volans>	 jbond: to answer your previous question, if noop fails completely to compile or similar it doesn't end up in puppetboard
[16:03:55] <volans>	 we could probably add some heuristic to detect those kind of failures and retry without the /dev/null...
[16:04:56] <jbond>	 ack
[16:12:32] <jbond>	 vgutierrez: for that patch would you expect a diff?  
[16:12:44] <vgutierrez>	 jbond: yep, unless hiera is messing with me big time
[16:12:52] <jbond>	 The issue been hit is T238053 
[16:12:53] <stashbot>	 T238053: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053
[16:13:04] <jbond>	 in theory its just a json error and we retry with pson. 
[16:13:24] <jbond>	 we see that the catalogs both compile as we have them in the report
[16:13:54] <vgutierrez>	 oh wait.. backends are fetch via etcd so from PoV it could be a NOOP
[16:14:20] <jbond>	 can we could try merging this and see https://gerrit.wikimedia.org/r/c/labs/private/+/907918
[16:14:31] <jbond>	 or try running opcc that will produce a diff
[16:14:45] <jbond>	 it looks like itas working correctly but the warning message is not the best
[16:48:34] <jbond>	 vgutierrez: i have a feeling this is a noop.  however i updated the warning mesages to try and make it a bit clear that it realy is a warning 
[16:48:37] <jbond>	 https://puppet-compiler.wmflabs.org/output/907912/40596/cp3052.esams.wmnet/change.cp3052.esams.wmnet.err
[16:49:03] <vgutierrez>	 jbond: cheers
[16:49:11] <vgutierrez>	 NOOP in PCC for sure
[16:49:31] <jbond>	 ok cool then i think you can safley ignore the warning message
[16:52:52] <vgutierrez>	 thanks
[16:55:04] <jbond>	 np
[16:56:57] <bblack>	 jbond: got a sec to discuss the netbox a/a discovery thing? https://phabricator.wikimedia.org/T330084
[16:58:15] <bblack>	 anyways, it *seems* like that error in the ticket would've only happened if the puppet agent hadn't run (for the related change) on all the DNS servers before the authdns-update? But even then, I'm not 100% sure.
[16:58:25] <bblack>	 either way, I think the important details that help are:
[16:58:46] <bblack>	 1) The namespaces for geoip and metafo (a/a vs a/p) are independent.  You can have the same name existing in both places at the same time.
[16:59:24] <bblack>	 2) It's probably simpler (and would work around anything I missed above) to add-then-remove, instead of doing it all in one go.
[17:00:02] <bblack>	 by that I mean: (1) puppet change to add the new variant (without removing the old) -> (2) DNS change to switch the record to point at the new one -> (3) puppet change to remove the now-unused old one.
[17:00:58] <bblack>	 maybe the above even is a little simplistic, due to the DNS CI "mock" stuff.
[17:01:39] <bblack>	 so the sequence is really more like 5 commits total.
[17:02:34] <bblack>	 (1) puppet change to add new a/a service (2) DNS change to add matching mock_etc entry (3) DNS change to switch the record for lookups (4) DNS change to remove the old mock_etc entry (5) Puppet change to remove the old a/p service
[17:04:31] <bblack>	 [and the puppet change from step 1, needs to be agent-applied in all authdns boxes before (2)]
[17:11:20] <jbond>	 bblack: ack thanks for the info, are you happy for me to copy paste this into the task for posperity?  
[17:11:54] <jbond>	 ill chat with XioN.oX i think there where some other issues with netbox A/A but if not ill give it a try tomorrow 
[17:19:19] <jbond>	 bblack: added to https://phabricator.wikimedia.org/T330084#8772353 thanks
[17:19:58] <bblack>	 jbond: np!
[17:35:43] <jhathaway>	 Did cloud services hosts ever get pulled down with the wmf-update-known-hosts-production script? I don't remember adding them manually before?
[17:36:51] <taavi>	 define "cloud services hosts"
[17:37:03] <taavi>	 are you talking about cloud* hardware or cloud vps vms?
[17:41:57] <jbond>	 jhathaway: no they dont, i think the ssh config from the wmf-sre-laptop package sets wmcs hosts to StrictHostKeyChecking ask i think
[17:42:24] <jbond>	 taavi: fyi the wmf-update-known-hosts-production pulls in all known_hosts fingerprints from production 
[17:42:40] <taavi>	 yes I am aware :P
[17:42:57] <jbond>	 jhathaway: i think it could be benefititl to add the cloud bastion hosts but not sure its worth adding the fingerprints of every wmcs host
[17:43:13] <jhathaway>	 ah! that is probably how I got them, thanks jbond, taavi I was specifically thinking about the bastion hosts, restricted.bastion.wmcloud.org
[17:44:15] <taavi>	 yeah it might be worth to ship those to wmf-laptop
[18:34:24] <taavi>	 took a stab at implementing ssh ca support at https://gerrit.wikimedia.org/r/c/operations/puppet/+/907940/, that would mean we would not need to include keys for individual hosts and could instead include just the ca key instead
[18:43:16] <cdanis>	 taavi: neat!
[18:44:17] <taavi>	 reviews welcome :-P
[19:23:07] <jhathaway>	 taavi: very nice!
[21:40:37] <denisse>	 Hello, I get the following error when trying to compile my Puppet changes locally: ERROR: Unable to find facts for host
[21:40:37] <denisse>	 Searching for my issue I found a Wiki page that mentions this issue and the solution is to gather facts manually but it's not clear in the Wiki how to do that. https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Troubleshooting
[21:41:26] <denisse>	 Do you know how can 'collect facts by hand from the corresponding puppetmaster' to my local?
[21:41:45] <mutante>	 denisse: the way to do it changed from the old script to the new script to the even newer script
[21:41:54] <mutante>	 each time it was supposed to become a little easier, heh
[21:41:59] <mutante>	 https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes
[21:42:30] <mutante>	 so it _should_ now be:
[21:42:31] <mutante>	 $ ssh puppetmaster1001.eqiad.wmnet sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080
[21:42:35] <mutante>	 $ ssh pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud sudo systemctl start pcc_facts_processor.service
[21:42:51] <denisse>	 Thanks, I saw that but if I understand correctly that updates the facts in the puppermaster but not in my local, right?
[21:44:00] <mutante>	 Oh, sorry. I skipped the word "locally" completely since I have never ran that locally ever. I always use https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/
[21:44:36] <mutante>	 it updates the facts on the machines running the compiler in cloud
[21:44:42] <mutante>	 with facts from the prod puppetmaster
[21:46:09] <denisse>	 No problem, hopefully someone knows how to update the facts manually so the instruactions can be added to the Wiki. :D
[21:57:37] <mutante>	 jbond: ^ can you help?
[21:59:00] <jbond>	 denisse: are you running pcc locally?
[21:59:14] <denisse>	 jbond: Yes, I'm running it locally. :)
[21:59:38] <denisse>	 I can run my changes without issues in eqiad and codfw. In our other DCs I always get the same issue.
[21:59:43] * jbond even i dont do that 
[22:00:02] <jbond>	 how are yu running it localy?  
[22:00:29] <jbond>	 i.e. what cli ar you using?
[22:01:05] <jbond>	 or shuld i saw command, arfuments and environment vars
[22:01:55] <denisse>	 This is how I'm running it: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility)
[22:02:34] <denisse>	 If it's not the right way to do it and only I use that maybe we could remove it from the wiki and the script from the repository to avoid confusion. :)
[22:02:50] <jbond>	 ahh ok cool, that runs uses a local cli but still runs the jobs on the main pcc cluster
[22:03:02] <jbond>	 o thats my prefred way to use it :)
[22:03:31] <denisse>	 Oh, okay. So it connects to the main PCC, right?
[22:03:33] <jbond>	 so using the instructions posted byt mutante should fix the issue
[22:04:34] <jbond>	 yes it connects to jenkines and runs the same job that would run if you go to https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/build?delay=0sec
[22:06:12] <denisse>	 Sadly the instructions are not working. The pcc_facts_processor.service fails ...
[22:06:47] <jbond>	 one sec, ill take a look
[22:07:09] <denisse>	 Thanks. <3
[22:07:38] <denisse>	 It looks like a permissions issue to me. Possibly it requires a specific user to do the sync.
[22:08:22] <mutante>	 denisse: hmmm.. didn't I add you to that project in the past? checking horizon
[22:08:33] <jbond>	 ahh yes that could be it
[22:09:02] <mutante>	 so I remember doing this just recently
[22:09:07] <mutante>	 to fix the same issue?
[22:09:20] <mutante>	 denisse is in there as "member, reader"
[22:09:31] <mutante>	 others are also projectadmins
[22:09:39] <mutante>	 but you know how the group names changed too
[22:09:42] <mutante>	 in cloud VPS
[22:09:48] <jbond>	 there is some hire setting to make sure everyone in a group is auto added.  ill try and rember to send a patch to make sure all ops and sre-admins are auti added to this group
[22:10:10] <mutante>	 sudo is set to "any project user"
[22:11:21] <mutante>	 hmm.. she is in there, all I can do is "revoke" things
[22:11:24] <jbond>	 mutane if yu are in horizon look at the hira config for the restricted bastion host in the bastion project
[22:11:27] <mutante>	 trying to do it again 
[22:11:37] <mutante>	 oh.. the bastion project!
[22:11:52] <jbond>	 we basicaly need the same for the puppet-diff project but ayt project level (its at host level for the bastion)
[22:12:49] <mutante>	 profile::ldap::client::labs::restricted_to:
[22:12:49] <mutante>	 - ops
[22:12:50] <mutante>	 - sre-admins
[22:13:35] <jbond>	 yes exactly, if you could add that as a project puppet default hiera config for puppet-diff it should fix this
[22:14:59] <mutante>	 hmmm, yes, I am trying to do that but... 
[22:15:09] <jbond>	 one sec im going to send a patch :)
[22:15:36] <mutante>	 I click "apply changes" and it.. doesnt do it
[22:15:50] <mutante>	 the Hiera config seems unchanged
[22:16:03] <jbond>	 perhaps a syntax error?
[22:16:20] <jbond>	 either way its better to have it in puppet repo then horizon, one sec
[22:17:21] <mutante>	 yea, let's add in repo. it's better regardless
[22:17:27] <jbond>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/907994
[22:18:07] <mutante>	 merging 
[22:18:39] <jbond>	 cool
[22:18:56] <jbond>	 let me know how it goes, hopefully this removes one more step from onboarding
[22:20:08] <jbond>	 ok im going to check out now, enjoy your day/evening/morning/..
[22:20:46] <mutante>	 jbond: thank you, good night, I am taking care of puppet runs
[22:21:06] <jbond>	 great thanks and night :)
[22:21:13] <mutante>	  /Stage[main]/Security::Access/File[/etc/security/access.conf.d/99-labs_restrict_to_project]/ensure: removed
[22:21:23] <mutante>	 /Stage[main]/Profile::Ldap::Client::Labs/Security::Access::Config[labs-restrict-to-group]/File[/etc/security/access.conf.d/99-labs_restrict_to_group]/ensure: defined content as  ...
[22:21:48] <mutante>	 "labs" just not going away
[22:22:56] <jbond>	 mutante: i dont think anyone outside of sre was in that project so it shld be fine, but ill double check tomorro, possible taavi was in there
[22:24:06] <mutante>	 denisse: try it now
[22:24:15] <mutante>	 jbond: *nod* great! ok
[22:24:23] <mutante>	 I ran puppet on the 3 "worker" instances
[22:27:37] <denisse>	 Thanks a lot for your help, trying it now.
[22:32:03] <denisse>	 I ran puppet on pcc-db1001 but the daemon keeps failing to update the facts.
[22:32:22] * denisse looking at the issue
[22:33:58] <mutante>	 maybe make a pastebin