[08:53:17] What is the deal with all those Storage /var over 50% global noc alerts? [08:53:21] Should we track that somewhere? [09:13:49] <_joe_> marostegui: ? [09:14:22] _joe_: https://alerts.wikimedia.org/?q=alertname%3DStorage%20%2Fvar%20over%2050%25 [09:14:32] We've got a bunch of emails since yesterday [09:14:51] <_joe_> the part I didn't understand was "global noc" [09:14:53] <_joe_> :D [09:15:08] _joe_: I just copied the email subject [09:15:10] <_joe_> so topranks or XioNoX might be updating those switches [09:15:37] _joe_: yeah, there's eqiad maintenance this week and the following ones, but should we silence those, ack? [09:17:33] _joe_: indeed yes sorry this was us [09:19:28] marostegui: I should have removed the image after the upgrade, alert should clear now [09:19:37] topranks: \o/ thanks [09:20:08] we should maybe review the alerting, the network devices usually don't fill up their drive, or have the potential to like a normal system [09:23:15] I've modified it so the threshold is higher and delay before alerting longer to give us some buffer when doing these works [09:23:20] (in LibreNMS) [10:13:46] I've got a reimage that seems stuck at '[9/10, retrying in 1280.00s] Attempt to run 'spicerack.puppet.PuppetServer.wait_for_csr' raised: The puppet server has no CSR for moss-be2001.codfw.wmnet', which I don't think I've seen time a lot of time before... [10:45:56] I'm trying again (having successfully reimaged another host), and it's still sticking here. [10:46:57] https://phabricator.wikimedia.org/P64944 <-- any suggestions on unsticking this? [10:48:22] AFAICT it's generating a new Puppet cert OK on moss-be2001, generates a CSR, and then somehow that fails to end up on the puppet server? [10:49:18] Emperor: should that be using puppet 5 or puppet 7? [10:49:57] 7 [10:50:16] i see the CSR has made it to the puppet 5 server.. [10:51:28] Hm, maybe the data-persistence insetup role hasn't been migrated, but I thought moritzm had done that. [10:51:46] [if that's the issue, I have a CR to assign that node to a new role, so should probably just merge that] [11:00:03] Emperor: hmm, the logs look very weird.. spicerack logs show that spicerack thinks that d-i completed and the server booted into the new OS at around 09:17, while /var/log/installer/syslog on the box shows d-i booting at 09:16 [11:00:30] so somehow the host was (re?)-running d-i when spicerack thought it was on the new OS installation, so it didn't manage to set the file telling d-i which Puppet version to install [11:01:09] Yeah, the host was trying to boot off the wrong drive, I fixed it, but the installer went through another time. Maybe I should just start it from scratch again. [11:01:34] I think the options at this point are either to re-try the reimage, or to manually push the host to puppet 7 and send a CSR to puppet 7 servers, and then hope the cookbook continues as usual [11:01:38] that would do it, I think [11:02:44] GitLab needs a short maintenance in one hour [11:02:55] I'll re-try the whole reimage [11:19:57] the data-persistence insetup role defaults to Puppet 7 [11:20:30] so this seems like some race condition with the cookbook of sorts [11:21:26] reinstall-from-scratch has got to Generating a new Puppet cert. [11:22:13] moritzm: if the cookbook now needs to insert a puppet version into the installer somehow, then that won't work if the installer runs more than once "under" the cookbook (e.g. because it didn't work and the operator fixed it); I think that used to work because there wasn't a question of puppet version [11:22:28] ...and it's now signed oK [11:29:12] the reimage cookbook only needs to get passed the Puppet version when running with --new (and IIRC this is a mandatory argument), for the case of reimaging a system already known to puppetdb the existing Hiera config is queried [11:49:35] Sigh, now the reimage is stuck at '[33/50, retrying in 99.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title moss-be2001 not found yet' [11:50:54] has something else gone wrong, or am I just doomed to spend all day repeatedly trying to re-image this node? :( [11:51:35] it signed the new puppet cert OK, then said "Run Puppet in NOOP mode to populate exported resources in PuppetDB" and now has been sat for quite some time because Nagios doesn't know about the host. [11:51:35] [11:53:55] is it likely to converge in the next 15 tries and/or is there something I can kick to try and unwedge it? [11:59:50] I tried running puppet on alert1001, no obvious joy [12:00:02] the first puppet run failed even in noop mode, https://puppetboard.wikimedia.org/report/moss-be2001.codfw.wmnet/cbca7d7447dc9eace82c0a84a061c8ce3361258a [12:02:16] taavi: I fixed that problem yesterday with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042991 [12:02:40] taavi: and should that stop nagios finding it? [12:03:11] yes, if the puppet catalog fails to compile, the exported resource that adds the icinga host never gets added [12:03:46] (fixed> so I'm not sure why it's recurring - puppet runs fine on e.g. moss-be1001 after I applied that fix) [12:04:05] taavi: Hm, maybe the cookbook shold notice that, then. [12:08:01] taavi: the bit that's failing is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042991/2/modules/profile/manifests/cephadm/controller.pp where the mgmt hostname is constructed and looked up in $profile::netbox::data::mgmt ; will no-op mode have interfered with that somehow? [12:10:29] if so, I could wrap that in an if defined (and have it stick in some suitable dummy value if not) [12:14:54] Emperor: I would start from trying something like this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043750 (untested) [12:16:48] taavi: what would you think to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043752 ? [12:17:10] GitLab maintenance is delayed by 45 minutes [12:17:42] I'll start a roughly 12:45 UTC [12:17:55] taavi: (templating out hosts.epp expects something there, so having a bogus-but-extant value is I think more robust) [12:18:55] Emperor: that'd work too I think, I'd maybe add the warning from my patch to help with troubleshooting why the data is not there [12:19:46] will do [12:30:20] OK, now any attempt to reimage that node is exploding with a cumin failure :( [12:31:36] https://phabricator.wikimedia.org/P64965 <-- cookbook initialisation is failing [12:31:56] presumably, that's failing because it wants to try and downtime the host, but can't because icinga doesn't know about it? [12:35:43] that too is failing because the catalog compilation is still failing, this time with: [12:35:51] taavi@puppetserver2001 ~ $ sudo puppet lookup --render-as s --compile --node moss-be2001.codfw.wmnet profile::puppet::agent::force_puppet7 [12:35:51] Warning: Undefined variable '::_role'; [12:35:51] (file & line not available) [12:35:51] Warning: Scope(Class[Profile::Cephadm::Controller]): profile::cephadm::controller: did not find mgmt data for host moss-be2002.codfw.wmnet (moss-be2002.mgmt.codfw.wmnet) [12:35:51] Error: Could not run: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/puppet_code/environments/production/modules/cephadm/templates/hosts.epp, line: 14, column: 11) [12:38:46] endless sadness [12:39:30] well, I can put conditionals round the bits of templating too [12:45:17] taavi: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1043762 ? [12:45:40] [also, does your rune re-run the attempted compilation, or will I need to poke the host separately] [12:46:27] I don't think this is due to the no-op run.. the warning mentions moss-be2002 which is not on PuppetDB at all and is marked as failed in Netbox [12:47:15] That node needs re-imaging. I was going to do that once I'd got moss-be2001 reimaged /o\ [12:47:50] but I think the template change is useful anyway? [12:47:55] anyhow, can ceph handle a node with 'addr: HOST LOOKUP FAILED!'? or should that be omitted entirely? [12:48:35] taavi: it'll refuse to add that node, but that's OK; the operator has to explicitly feed the hosts file to cephadm (and it won't apply it if unhappy) [12:49:10] fair enough [12:51:47] your puppet lookup rune now says "true", so I'll try again [12:55:16] GitLab maintenance finished, sorry for the delays :) [13:02:32] btw team, thoughts/comments on https://phabricator.wikimedia.org/T367466 appreciated [13:06:21] taavi: thanks so much for your help [15:24:19] for fans of IPv6-only deployments: https://phabricator.wikimedia.org/P64983 [15:40:05] Emperor: wow I am impressed :) [15:40:07] kudos [15:40:30] is this something you are trying to move towards? [15:40:48] It seemed like a sensible thing to try doing with the new cephadm-based clusters. [15:42:58] great [15:45:06] <_joe_> and what for fans of ipv4-only in the datacenter? :D [15:56:48] _joe_: they get to enjoy the schadenfreude when there turn out to be subtle bugs in the cephadm ipv6 code [15:57:06] <_joe_> I would NEVER [15:57:12] :) [19:47:58] Is there a way to track back what email address is associated with the "Search Platform" contact as defined in Puppet ( https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/wmflib/types/team.pp ) ? Trying to figure out why we didn't get some alerts for dead elastic hosts