[09:16:58] godog: there's a change from you pending to be merged [09:20:28] marostegui: doh! sorry [09:20:49] {{done}} [09:21:06] <3 [10:07:18] jbond: did you perhaps forget to write: yes on the puppet-merge? :) the process has been locked for a while [10:09:51] marostegui: done sorry [10:09:59] thank you! merging my change then [15:15:02] hey folks [15:15:13] I am reimaging an-worker1132 and this step is getting ages [15:15:13] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title an-worker1132 not found yet [15:15:24] has it happened before? [15:15:46] IIUC the function calls the puppetdb's api and it gets the above answer [15:32:12] in Puppet is there a way to share hiera settings between roles? I have a change proposal to have ci::manager and ci::worker roles and I can't find a way to share year settings between them. [15:32:17] I have tried `hieradata/role/common/ci.yaml` and `hieradata/role/common/ci/common.yaml` but those do not seem to be part of the lookup hierarchy. [15:32:28] (the change is https://gerrit.wikimedia.org/r/c/operations/puppet/+/907886 ) [15:34:17] elukey: is this a host that you tried to previously reimage? [15:34:31] IIRC I got that once and I passed --new but I am not a 100% :) [15:35:04] sukhe: o/ it was an host down for days, I passed --new since it wasn't in puppetdb anymore [15:35:17] ok sorry then :) [15:35:25] hashar: you probably need to share it via global config [15:35:34] sukhe: nono thanks for the suggestion! Anything is useful [15:35:41] I am checking the cookbook's code atm [15:36:16] ah ok so I ran manually puppet agent --noop etc.. on the node via install_console, and that did the trick [15:36:40] interesting! [15:36:54] elukey: dose it give any indication that the cook book run of that failed? [15:36:59] there is nothing in the logs [15:37:00] (basically what _populate_puppetdb does) [15:37:12] jbond: lemme recheck, but I didn't see anything [15:37:36] so I see: [15:37:37] Signed new Puppet certificate ----- OUTPUT of 'puppet agent -t --noop &> /dev/null' ----- [15:37:41] ================ 100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'puppet agent -t --noop &> /dev/null'. [15:37:45] 100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands. Run Puppet in NOOP mode to populate exported resources in PuppetDB [15:37:49] [1/50, retrying in 3.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title an-worker1132 not found yet [15:37:53] [..] [15:37:57] strange [15:38:00] yeah [15:38:41] what's up? [15:38:46] hashar: ill comment on the change [15:38:55] volans: the noop pupet run failed to pupulate puppetdb [15:39:00] running it maualy fixed things [15:39:01] great thanks jbond :) [15:39:11] logs all look like everything worked ok [15:39:17] ah interesting, because I was about to say that mos tlikely puppet was broken [15:39:27] as we have to consider any exit code as successful from noop [15:39:39] and we send the output to dev/null for not spamming the console [15:39:58] ahh ok so the puppet run could have failed possibly even for some transient issue [15:40:37] could make sense, the reuse recipe for the node didn't take into account the 4TB disks so I used a little script via install_console to fix them [15:40:49] (basically to populate fstab) [15:41:07] and they are referenced in the code, maybe this is the culprit [15:41:34] yes that could be it, syslog would probably have the puppet error [15:41:48] but at this point an error msg with "step failed, do you want to retry the puppet run?" could be great [15:41:50] actully maybe not [15:41:51] if possible [15:42:40] https://puppet-compiler.wmflabs.org/output/907912/40594/ is reporting a NOOP but I'm getting a warning on both production/change catalog saying Warning: Failed to compile catalog for node cp5020.eqsin.wmnet: source sequence is illegal/malformed utf-8 [15:49:53] vgutierrez: seems to be hitting a bug in pcc ill take a look [15:50:07] jbond: cheers <3 [15:50:20] and sorry for breaking pcc [15:51:23] np [16:02:38] jbond: to answer your previous question, if noop fails completely to compile or similar it doesn't end up in puppetboard [16:03:55] we could probably add some heuristic to detect those kind of failures and retry without the /dev/null... [16:04:56] ack [16:12:32] vgutierrez: for that patch would you expect a diff? [16:12:44] jbond: yep, unless hiera is messing with me big time [16:12:52] The issue been hit is T238053 [16:12:53] T238053: puppet-compiler fails to compile production catalog for restbase2014 - https://phabricator.wikimedia.org/T238053 [16:13:04] in theory its just a json error and we retry with pson. [16:13:24] we see that the catalogs both compile as we have them in the report [16:13:54] oh wait.. backends are fetch via etcd so from PoV it could be a NOOP [16:14:20] can we could try merging this and see https://gerrit.wikimedia.org/r/c/labs/private/+/907918 [16:14:31] or try running opcc that will produce a diff [16:14:45] it looks like itas working correctly but the warning message is not the best [16:48:34] vgutierrez: i have a feeling this is a noop. however i updated the warning mesages to try and make it a bit clear that it realy is a warning [16:48:37] https://puppet-compiler.wmflabs.org/output/907912/40596/cp3052.esams.wmnet/change.cp3052.esams.wmnet.err [16:49:03] jbond: cheers [16:49:11] NOOP in PCC for sure [16:49:31] ok cool then i think you can safley ignore the warning message [16:52:52] thanks [16:55:04] np [16:56:57] jbond: got a sec to discuss the netbox a/a discovery thing? https://phabricator.wikimedia.org/T330084 [16:58:15] anyways, it *seems* like that error in the ticket would've only happened if the puppet agent hadn't run (for the related change) on all the DNS servers before the authdns-update? But even then, I'm not 100% sure. [16:58:25] either way, I think the important details that help are: [16:58:46] 1) The namespaces for geoip and metafo (a/a vs a/p) are independent. You can have the same name existing in both places at the same time. [16:59:24] 2) It's probably simpler (and would work around anything I missed above) to add-then-remove, instead of doing it all in one go. [17:00:02] by that I mean: (1) puppet change to add the new variant (without removing the old) -> (2) DNS change to switch the record to point at the new one -> (3) puppet change to remove the now-unused old one. [17:00:58] maybe the above even is a little simplistic, due to the DNS CI "mock" stuff. [17:01:39] so the sequence is really more like 5 commits total. [17:02:34] (1) puppet change to add new a/a service (2) DNS change to add matching mock_etc entry (3) DNS change to switch the record for lookups (4) DNS change to remove the old mock_etc entry (5) Puppet change to remove the old a/p service [17:04:31] [and the puppet change from step 1, needs to be agent-applied in all authdns boxes before (2)] [17:11:20] bblack: ack thanks for the info, are you happy for me to copy paste this into the task for posperity? [17:11:54] ill chat with XioN.oX i think there where some other issues with netbox A/A but if not ill give it a try tomorrow [17:19:19] bblack: added to https://phabricator.wikimedia.org/T330084#8772353 thanks [17:19:58] jbond: np! [17:35:43] Did cloud services hosts ever get pulled down with the wmf-update-known-hosts-production script? I don't remember adding them manually before? [17:36:51] define "cloud services hosts" [17:37:03] are you talking about cloud* hardware or cloud vps vms? [17:41:57] jhathaway: no they dont, i think the ssh config from the wmf-sre-laptop package sets wmcs hosts to StrictHostKeyChecking ask i think [17:42:24] taavi: fyi the wmf-update-known-hosts-production pulls in all known_hosts fingerprints from production [17:42:40] yes I am aware :P [17:42:57] jhathaway: i think it could be benefititl to add the cloud bastion hosts but not sure its worth adding the fingerprints of every wmcs host [17:43:13] ah! that is probably how I got them, thanks jbond, taavi I was specifically thinking about the bastion hosts, restricted.bastion.wmcloud.org [17:44:15] yeah it might be worth to ship those to wmf-laptop [18:34:24] took a stab at implementing ssh ca support at https://gerrit.wikimedia.org/r/c/operations/puppet/+/907940/, that would mean we would not need to include keys for individual hosts and could instead include just the ca key instead [18:43:16] taavi: neat! [18:44:17] reviews welcome :-P [19:23:07] taavi: very nice! [21:40:37] Hello, I get the following error when trying to compile my Puppet changes locally: ERROR: Unable to find facts for host [21:40:37] Searching for my issue I found a Wiki page that mentions this issue and the solution is to gather facts manually but it's not clear in the Wiki how to do that. https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Troubleshooting [21:41:26] Do you know how can 'collect facts by hand from the corresponding puppetmaster' to my local? [21:41:45] denisse: the way to do it changed from the old script to the new script to the even newer script [21:41:54] each time it was supposed to become a little easier, heh [21:41:59] https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Updating_nodes [21:42:30] so it _should_ now be: [21:42:31] $ ssh puppetmaster1001.eqiad.wmnet sudo /usr/local/sbin/puppet-facts-upload --proxy http://webproxy.eqiad.wmnet:8080 [21:42:35] $ ssh pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud sudo systemctl start pcc_facts_processor.service [21:42:51] Thanks, I saw that but if I understand correctly that updates the facts in the puppermaster but not in my local, right? [21:44:00] Oh, sorry. I skipped the word "locally" completely since I have never ran that locally ever. I always use https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/ [21:44:36] it updates the facts on the machines running the compiler in cloud [21:44:42] with facts from the prod puppetmaster [21:46:09] No problem, hopefully someone knows how to update the facts manually so the instruactions can be added to the Wiki. :D [21:57:37] jbond: ^ can you help? [21:59:00] denisse: are you running pcc locally? [21:59:14] jbond: Yes, I'm running it locally. :) [21:59:38] I can run my changes without issues in eqiad and codfw. In our other DCs I always get the same issue. [21:59:43] * jbond even i dont do that [22:00:02] how are yu running it localy? [22:00:29] i.e. what cli ar you using? [22:01:05] or shuld i saw command, arfuments and environment vars [22:01:55] This is how I'm running it: https://wikitech.wikimedia.org/wiki/Help:Puppet-compiler#Catalog_compiler_local_run_(pcc_utility) [22:02:34] If it's not the right way to do it and only I use that maybe we could remove it from the wiki and the script from the repository to avoid confusion. :) [22:02:50] ahh ok cool, that runs uses a local cli but still runs the jobs on the main pcc cluster [22:03:02] o thats my prefred way to use it :) [22:03:31] Oh, okay. So it connects to the main PCC, right? [22:03:33] so using the instructions posted byt mutante should fix the issue [22:04:34] yes it connects to jenkines and runs the same job that would run if you go to https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/build?delay=0sec [22:06:12] Sadly the instructions are not working. The pcc_facts_processor.service fails ... [22:06:47] one sec, ill take a look [22:07:09] Thanks. <3 [22:07:38] It looks like a permissions issue to me. Possibly it requires a specific user to do the sync. [22:08:22] denisse: hmmm.. didn't I add you to that project in the past? checking horizon [22:08:33] ahh yes that could be it [22:09:02] so I remember doing this just recently [22:09:07] to fix the same issue? [22:09:20] denisse is in there as "member, reader" [22:09:31] others are also projectadmins [22:09:39] but you know how the group names changed too [22:09:42] in cloud VPS [22:09:48] there is some hire setting to make sure everyone in a group is auto added. ill try and rember to send a patch to make sure all ops and sre-admins are auti added to this group [22:10:10] sudo is set to "any project user" [22:11:21] hmm.. she is in there, all I can do is "revoke" things [22:11:24] mutane if yu are in horizon look at the hira config for the restricted bastion host in the bastion project [22:11:27] trying to do it again [22:11:37] oh.. the bastion project! [22:11:52] we basicaly need the same for the puppet-diff project but ayt project level (its at host level for the bastion) [22:12:49] profile::ldap::client::labs::restricted_to: [22:12:49] - ops [22:12:50] - sre-admins [22:13:35] yes exactly, if you could add that as a project puppet default hiera config for puppet-diff it should fix this [22:14:59] hmmm, yes, I am trying to do that but... [22:15:09] one sec im going to send a patch :) [22:15:36] I click "apply changes" and it.. doesnt do it [22:15:50] the Hiera config seems unchanged [22:16:03] perhaps a syntax error? [22:16:20] either way its better to have it in puppet repo then horizon, one sec [22:17:21] yea, let's add in repo. it's better regardless [22:17:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/907994 [22:18:07] merging [22:18:39] cool [22:18:56] let me know how it goes, hopefully this removes one more step from onboarding [22:20:08] ok im going to check out now, enjoy your day/evening/morning/.. [22:20:46] jbond: thank you, good night, I am taking care of puppet runs [22:21:06] great thanks and night :) [22:21:13] /Stage[main]/Security::Access/File[/etc/security/access.conf.d/99-labs_restrict_to_project]/ensure: removed [22:21:23] /Stage[main]/Profile::Ldap::Client::Labs/Security::Access::Config[labs-restrict-to-group]/File[/etc/security/access.conf.d/99-labs_restrict_to_group]/ensure: defined content as ... [22:21:48] "labs" just not going away [22:22:56] mutante: i dont think anyone outside of sre was in that project so it shld be fine, but ill double check tomorro, possible taavi was in there [22:24:06] denisse: try it now [22:24:15] jbond: *nod* great! ok [22:24:23] I ran puppet on the 3 "worker" instances [22:27:37] Thanks a lot for your help, trying it now. [22:32:03] I ran puppet on pcc-db1001 but the daemon keeps failing to update the facts. [22:32:22] * denisse looking at the issue [22:33:58] maybe make a pastebin