[06:26:40] hello folks, moving kafka-logging1002 to PKI [06:42:02] all good, 1002 migrated. Only 1003 left :) [07:47:47] \o/ [07:48:35] going to do 1003 as well :) [07:54:34] done! metrics are recovering, will watch for a bit [07:55:41] ok now kafka logging is working on PKI-only TLS certs [07:56:05] some cleanup work is needed (like https://gerrit.wikimedia.org/r/c/operations/puppet/+/838650) but it can be done in a few days, just to be sure [07:56:17] jbond: --^ \o/ [08:34:34] elukey: awesome :) [13:14:18] jbond: hello! if you have some time today, wanted to discuss an unexpected behaviour we are seeing on the first Puppet run for a new host (dns4003) [13:14:40] were setting up dns4003 yesterday and running into some weird dependencies issues. essentially, it seems like the Puppet run is failing with some hard dependencies ... see https://puppetboard.wikimedia.org/report/dns4003.wikimedia.org/1ea8435115b4084fd2d1dc06eb253b49c9bbc99d [13:14:56] putting aside the ordering issues with some of the other modules (anycast, pdns-rec, etc.) shouldn't the base system be set up? that's also not happening "Skipping because of failed dependencies" [13:15:06] any ideas on what's happening here? thanks! [13:15:57] commit f67b66221 for example points to a race condition related to depdencies but we have not been able to pinpoint the issue [13:26:28] sukhe: i have a meeting in 5 mins but will ping after [13:26:34] np and thanks! [14:10:53] <_joe_> bblack, vgutierrez, volans, jhathaway: I am in the process of removal of php 7.2 from the appservers. Nothing should happen but just so you know [14:11:02] thanks! [14:11:07] 🍿 [14:11:08] thanks [14:11:12] thanks for the headsup [14:38:20] sukhe: ok so as to the dependencies the reason that we add the dependencies in f67b66221 is to that 1 (Class['adduser'] -> Package<| |>) ensure our login.defs and sysuser configueration is in place before any tools are installed. this ensures that any tools that use adduser or sysuser to create users get the correct dynamic uid/gid. and ... [14:38:34] 2 (Class['apt'] -> Package<| title != 'gnupg' |>) to make sure our apt sources are configuered before we install anything [14:39:04] i took a look at the dot file posted above and afaik the following should loose the dependency graph enough to make things work https://gerrit.wikimedia.org/r/c/operations/puppet/+/838804 [14:39:52] specifically droppnig the class {'bird': require => Class['::bird::anycast_healthchecker'],} dependency [15:03:08] bblack: i did look at it when i introduced the adduser dependency. however it caused some other issues i cant rember exactly of the top of my head what it was but i think that it was related to cross dependencies betwean apt, systemd and $stuff [15:03:26] it probably is worth trying to resolve those issues though [15:03:51] jbond: I have always wished and maybe there is, but is there way to better simply these dependencies in Puppet? [15:04:11] as in, I don't know, a visual depiction or something that actually lays these out for you vs going through them manully? [15:04:34] (perhaps a conversation for later) [15:04:52] sukhe: not that i know of its inherently a pain jhathaway may know of something [15:04:58] yeah :( [15:05:42] there is a python tool I have used for debugging these issues in the past [15:06:14] sukhe: that graph I posted, you can make such graphs for any run of any role [15:06:18] there's also a more low tech fix, which would reliably fix this: we create a wmf-base deb which's main purpose is to pull in all the packages we need before puppet runs the first time (like adduser) and then we install it via late_command.sh [15:06:30] but for the whole functional tree, not just the failing part [15:06:45] it's kind of intense though [15:07:43] moritzm: I like the simplicity of that solution! :) [15:08:08] fyi 838804 updated [15:08:31] moritzm: or for that matter, if we don't want to maintain a seperate package and the late_command hack: we could even do a lighter-weight use of puppet's staging stuff [15:08:54] if late_command.sh is too hacky we can do that via the reimage cookbook too :D [15:09:00] define a very small pre-stage that just handles the base apt + packages stuff and nothing else. [15:09:31] jbond: looking [15:09:53] and the cookbook wouldn't even require a meta-package, could have the raw list of base packages in it :D [15:10:00] please keep in mind wmcs instances don't use d-i or the reimage cookbook [15:10:43] I am going to disable Puppet on P:Bird::Anycast [15:11:49] done [15:11:55] taavi: for those I would expect to add those to the base WMCS images [15:11:59] jbond: please feel free to proceed :) [15:12:15] sukhe: ack merging now [15:15:52] jbond: let me know when ready and we can first test it on the broken dns instance, where it actually matters I think [15:16:22] I think for the other hosts, it should be a NOOP anyway [15:16:35] the host is dns4003 [15:16:43] sukhe: ahh sorry running on dns4003 now. ill paste the full log when complete [15:16:50] thanks! [15:16:54] or the link to puppetboard :D [15:16:57] yes it should be noop everywhere else [15:18:44] looks like success? [15:18:53] ish [15:19:05] it cant install bird.conf untill /etc/bird/anycast-prefixes.conf exists [15:19:06] yeah, some failures but definitely an improvement [15:19:19] if it gets there in multiple runs, that's an improvement for sure [15:19:31] yes it should converge [15:19:33] one run is ideal, but I'll take multiple runs that work, vs getting perma-stuck [15:20:40] im not sure how frequently /etc/bird/anycast-prefixes.conf is written but it wont converg untill thats there it might be orth creating an empty file for those to allow puppet to complete [15:20:45] Oct 5 15:20:01 dns4003 puppet-agent[15177]: (/Stage[main]/Bird/File[/etc/bird/bird.conf]/content) change from '{md5}cd5fe4f7a2cfec4850aae926c1724c [15:20:48] 7f' to '{md5}1392a9b1928a6ef93abd343b0bf0dfde' failed: Execution of '/usr/sbin/bird -p -c /etc/bird/bird.conf20221005-15177-6h7kgh' returned 1: bir [15:20:51] d: /etc/bird/bird.conf20221005-15177-6h7kgh:1:10 Unable to open included file /etc/bird/anycast-prefixes.conf: No such file or directory (correctiv [15:20:54] e) [15:21:02] ^ that's probably part of or related to that dep on anycast-healthchecker [15:21:30] yes but currently puppet dosn;t create that file afaik its written by the service after $sometime after it starts [15:21:31] jbond: anycast-hc generates that file and the prefixes change when the service is down [15:21:59] right, but if you have a dep on the service starting and it creates on start [15:22:01] so the file changes when the check_cmd fails (which means the service is down and we should stop advertising the prefixes) [15:22:02] I don't have any numbers, but I'd be surprised of more than one third of our roles get fully applied with a single, initial Puppet run... [15:22:31] bblack: im not sutre its created on start (its still not created) [15:23:26] hmmm ok [15:23:48] it's not running though, right? [15:24:43] if you mean anycast-hc, no [15:26:35] no its not. anycast-healthchecker.service wont start because gdnsd is not instalkled and gdnsd wont install because https://phabricator.wikimedia.org/P35363 [15:26:58] error is ds4003.wikimedia.org port 443: Connection refused [15:27:01] dnss4003.wikimedia.org port 443: Connection refused [15:28:02] which does make sense given that dns4003 is not setup yet but I am not sure why it's trying to clone itself from itself here vs another host :) [15:28:18] yes im not too sure on that either [15:28:35] you should be able to login and look now [15:28:47] yeah let's take some time and unpack that, it may be a config thing unrelated to the deps issue [15:28:48] yes thanks, that works [15:29:23] yes indeed tyhis feels a bit different to the previous dependency issue. ill step a way but please ping if i can hel;p [15:29:35] thanks for the help in resolving this jbond! [15:29:38] we will take it from here [15:29:43] ack no probs [15:30:12] sukhe: fyi we shuld be able to merge that doh/acl/requestcl change tomorrow, ping when you want to move forward with it [15:30:21] sure! [15:31:24] I think that fetch is supposed to be from https://netbox-exports.wikimedia.org/ [15:31:54] hieradata/common/profile/dns/auth/update.yaml:profile::dns::auth::update::netbox_exports_domain: "%{alias('profile::netbox::automation::git_hostname')}" [15:32:14] hieradata/role/common/netbox/frontend.yaml:profile::netbox::automation::git_hostname: netbox-exports.wikimedia.org [15:32:20] hieradata/common/profile/netbox/automation.yaml:profile::netbox::automation::git_hostname: "%{facts.networking.fqdn}" [15:32:23] hieradata/role/common/netbox/frontend.yaml:profile::netbox::automation::git_hostname: netbox-exports.wikimedia.org [15:32:37] it seems like it's using the former, which isn't meant to happen for this role? [15:32:41] or it's some defaulting issue [15:33:00] Stdlib::Unixpath $netbox_dns_snippets_dir = lookup('profile::dns::auth::update::netbox_dns_snippets_dir'), [15:33:09] doesn't seem to be defaulting here at least [15:33:41] indeed it ultimatly gets the folliowing [15:33:42] profile::netbox::automation::git_hostname: "%{facts.networking.fqdn} [15:34:04] i would expect this to be an issue on current machines but perhaps its only an issue when doing the original clone [15:34:08] :] [15:34:32] sukhe: maybe try in another dns box in another dc to repro? [15:34:43] (whether we've broken this variable for existing cases too) [15:36:04] volans: sorry to invoke you, but maybe you understand this better ^ [15:37:02] bblack: no worries, catching up with last few min of backlog [15:37:06] bblack: sukhe: long term fix i would just set git_hostname explicitly e.g. https://gerrit.wikimedia.org/r/c/operations/puppet/+/838844 [15:37:36] volans: just the netbox-exports hostname hieradata stuff [15:37:50] yep [15:37:55] there's no 443 listener on these machines, so it definitely wasn't doing this elsewhere [15:38:14] but yes, it may be that this particular invocation of git clone only happens on a fresh reimage, and it's been broken a while [15:38:31] you could set profile::netbox::automation::git_hostname but for the dns boxes that just looks confusing as they dont actully include that profile [15:38:34] f [15:38:36] fatal: unable to access 'https://dns4003.wikimedia.org/dns.git/': [15:38:47] why it's trying to localhost, that's weird [15:38:56] volans: because: [15:38:58] hieradata/common/profile/netbox/automation.yaml:profile::netbox::automation::git_hostname: "%{facts.networking.fqdn}" [15:39:03] it defaults to the host fqdn [15:39:10] combined with: [15:39:13] hieradata/common/profile/dns/auth/update.yaml:profile::dns::auth::update::netbox_exports_domain: "%{alias('profile::netbox::automation::git_hostname')}" [15:40:05] * volans wonders if we broke that and didn't realize [15:40:07] which [15:40:22] hang on [15:41:24] https://gerrit.wikimedia.org/r/c/operations/puppet/+/764330/25/hieradata/common/profile/netbox/automation.yaml [15:41:46] ^ this changed it from netbox-exports to the fqdn thing, which probably only broke our fresh dnsbox reimages, but not other running things [15:41:59] the other boxes look fine fwiw [15:42:02] that's totally possible [15:42:06] I was looking for the same [15:43:27] so that changes only the origin [15:43:28] was that change necessary? can we just change it back? [15:43:33] for the git::clone [15:43:39] for the authdns hosts [15:43:49] so I guess that doesn't break the existing hosts [15:43:53] because puppet doesn't change it [15:43:55] I'm checkign [15:44:30] yep [15:44:31] [remote "origin"] url = https://netbox-exports.wikimedia.org/dns.git [15:44:45] the deeper issue here is that validating that refactors "work" for existing roles (hopefully, if we've even checked all existing roles: in this case the Host: test regex didn't include affected dns boxes anyways) [15:44:57] isn't sufficient to prove that fresh reimages also still work after a refactor, which is also important [15:45:12] bblack: volans: i think we can either st the default of profile::netbox::automation::git_hostname: back to netbox-exports.wikimedia.org or https://gerrit.wikimedia.org/r/c/operations/puppet/+/838844/1/hieradata/common/profile/dns/auth/update.yaml [15:45:13] indeed, that's always the case [15:45:24] the only solution is to periodically reimage a host for each role... [15:45:41] I think we can hardcode the authdns one to netbox-exports.wikimedia.org [15:45:47] so to not affect netbox stuff [15:45:49] and split them [15:45:58] the alias are useful, in this case was harmful [15:46:10] i think the later is better. on of the reason this could have been missed is that dns boxes dont directly use profile::netbox::automation so one dosn;t neccesarily think to check them when changing profile/netbox/automation.yaml [15:46:11] we can explicitly set profile::dns::auth::update::netbox_exports_domain but I am not sure for the original case why it was done, so not sure if that breaksa something [15:46:27] right [15:46:30] sukhe: no it doesn't [15:46:39] let's set netbox_exports_domain hardcoded to netbox-exports.wikimedia.org [15:46:44] and there's the design question: should we be specifying what the netbox-exports hostname is more than once in general? [15:46:46] no alias magic [15:46:48] ok [15:47:06] bblack: what do you mean by " is more than once in general"? [15:47:09] we get tied up in our own rules here with the alias hack I think (about seperate between profiles, etc) [15:47:51] volans: that was in response to the idea that we add a new copy of that hostname in dns::auth-specific hieradata [15:48:53] jbond: I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/838844 [15:48:58] +1ed already [15:48:59] sukhe: ack sgtm [15:49:15] shoudl we use https://netbox-exports.discovery.wmnet instead? [15:49:22] same for me [15:50:09] sukhe: +1 [15:50:14] thanks, testing [15:52:11] this problematic reimage has touched on like, every major puppet pain point topic :) [15:52:25] Notice: /Stage[main]/Profile::Dns::Auth::Update/Git::Clone[/srv/git/netbox_dns_snippets]/Exec[git_clone_/srv/git/netbox_dns_snippets]/returns: executed successfully (corrective) [15:52:29] Info: Git::Clone[/srv/git/netbox_dns_snippets]: Scheduling refresh of Exec[authdns-local-update] [15:52:32] ahahah [15:52:32] progress [15:52:41] that's a code smell I'd say :D [15:52:49] but we need the Puppet equivalent of "have you tried turning it on and off again" "have you tried running agent again" [15:53:14] another source of tension, is the inverse relationship of the debian package standard model for daemons, vs what our (traffic at least) classes typically want to do. [15:53:51] the debian model is if a package is for a daemon, that the package should be able to start the daemon in postinst by default. [15:54:28] and we're constantly contorting our puppetization to push the other direction: we want the config and a bunch of related parts all working before the daemon ever starts, and we'd really prefer to not have the package try to start a daemon at all. [15:55:02] and you can't just generically skip postinst, because there's usually other important things that happen in postinst (like creating directories or setting perms, etc) [15:55:44] if there was just some ability to specific, on a per-package basis when installing/updating --no-start-daemon [15:56:05] s/specific/specify/ [15:56:42] or maybe even --no-enable-daemon [15:57:28] I have seen services being explicitly masked in some places before package install for this reason [15:57:42] yeah [15:57:58] we have a few different ways of handling it in different places [15:58:00] the reimage cookbook too has an option for that [15:58:06] i.e. masking units [15:58:57] in the authdns case, we really care, because if a stock unconfigured dns daemon starts answering requests on port 53, it could be the case that something is actually routing (even public!) requests to it, and it will break lookups and wreak havoc [15:59:22] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/hosts/reimage.py#63 [15:59:32] so we jump through all these hoops to make sure that all the data+config is in place before it can start [15:59:49] * volans vanishing in a meeting, sorry [16:00:14] volans: yeah but that seems like the wrong layer. installs+puppetization should still work right even if it's not the reimage script driving them [16:00:16] the "put down config -> mask -> install package -> unmask" pattern works pretty well for that [16:00:55] we have a system of doing that in puppet that works now for that use-case, it just adds a lot of complexity/dependency [16:01:25] it does yeah [16:01:35] (even the mask/unmask pattern) [16:02:07] also it constrains the puppet graph, even if just a bit [16:02:14] sukhe: it's possible the wrong bird version is installed? [16:03:16] bblack: I think what's happening here is [16:03:25] that it's not creating our bird.conf for some reason [16:03:32] and validate_cmd fails with CF_SYM_UNDEFINED error [16:03:36] right [16:03:47] so I am guessing this might be coming from the service[bird] change from class[bird] [16:03:53] where I am guessing class would have set everything up [16:04:30] yeah but it should still execute all that stuff, even if it's async over multiple runs... [16:04:47] correct [16:04:54] not sure why it's not creating the conf file at al lnow [16:05:29] I have a theory [16:05:45] the failure points to [16:05:48] /etc/puppet/modules/bird/manifests/init.pp:76 [16:06:17] and validate_cmd points to validate_cmd => '/usr/sbin/bird -p -c %', [16:06:26] but there isn't anything there to validate now [16:06:44] and that fails, and because validate_cmd fails, that's Puppet failing actually and not the service itself [16:07:05] ● bird.service - BIRD Internet Routing Daemon [16:07:05] Loaded: loaded (/lib/systemd/system/bird.service; enabled; vendor preset: enabled) [16:07:08] Active: active (running) since Wed 2022-10-05 15:17:39 UTC; 49min ago [16:07:15] Oct 05 15:17:39 dns4003 bird[12617]: Started [16:07:18] yeah but it's running with a presumably bad config [16:07:44] and additionally [16:07:48] well, missing config anyways [16:08:08] no anycat-prefixes.conf either, but that's expected because the service is not running [16:08:49] I wish it would save the template output that failed validate_cmd :P [16:09:28] I am going to start by removing validate_cmd and trying to get a clean run [16:09:31] we will put it back later [16:09:46] at least then you'll see what the file looks like [16:09:51] lol [16:18:28] finally! [16:18:29] a clean run [16:18:30] https://puppetboard.wikimedia.org/report/dns4003.wikimedia.org/beb9b88151517c3448e67077bee7d0071db9e298 [16:18:34] ok that fixed it [16:18:47] probably some weird race condition between all the failures and what not [16:19:03] I still don't want to take validate_cmd out though so I will put it back [16:19:33] sukhe: in a meeting but fyi i did `touch /etc/bird/anycast-prefixes.conf` to see if that was enough for bird -c to work [16:19:49] but when i did that we started to get the CF_SYM_UNDEFINED error instead of failed to include [16:19:52] ah, hm [16:20:11] yeah I think that error comes from the validation because it was reading the Debian default file instead of our config [16:20:22] ack