[00:28:17] Is this... normal? [00:28:19] https://www.irccloud.com/pastebin/ES62jDoX/ [00:28:48] For arcane reasons I need to do some in-place upgrades and am not sure where puppet is determining the wrong distro :( [00:31:23] I wouldn't say normal, no. But that's also because in-place upgrades are not really a thing anymore since all the newer automation ... [00:31:35] trying to find out a bit though [00:32:31] Weirdly I need to do in-place upgrades for the whole cluster and /then/ I can reimage [00:32:53] to avoid clobbering files and having them try to re-sync between versions [00:33:04] (Probably I can work around this by reimaging and the syncing by hand but I'd prefer not to) [00:34:06] When puppet thinks your host is buster but your host is actually bullseye it tries to install a LOT of python2 packages that aren't available and the wall of red is astounding [00:35:49] andrewbogott: so there is /etc/puppetlabs/facter/facter.conf and it has some "ttls" [00:35:59] and one of them is "operating system": 1 days [00:36:07] that would a guess [00:36:22] sounds like it might take a day to expire? [00:36:33] hm, I wonder if I implemented that :/ [00:36:35] Will try! [00:36:39] But first I need to renew some downtimes [00:37:46] andrewbogott: try facter --no-cache [00:38:41] yeah, that's it... [00:38:45] :) [00:39:00] But I need to force an expire, I guess I can try changing that ttl [00:39:19] so I guess then.. somehow introduce a one day waiting period [00:39:29] or delete that line from the config [00:39:58] https://puppet.com/blog/facter-part-3-caching-and-ttl/ [00:40:29] One of the benefits of this simple implementation is the custom facts can be forced to refresh by purging all files in /etc/puppet/facts.d directory, [00:40:45] it's not a custom fact though [00:41:43] hm, I removed those ttls and now something else is broken :( [00:41:49] andrewbogott: bad news https://tickets.puppetlabs.com/browse/FACT-1544 :p [00:41:58] " leaves behind cached facts if you enable fact TTLs in facter.conf and then remove those TTLs" [00:42:48] dang [00:42:53] they closed it as fixed though.. I am confused a bit [00:43:00] if that's true then I wonder why my catalog doesn't compile now [00:43:07] 2017 is a while ago too [00:44:09] catalog is messed up because it is somehow mixing the 2 versions? :/ [00:46:54] I think the distro check is working now, I'm trying to merge a fix for the other problem [00:48:22] ok [00:52:57] Thank you for finding that cache! [00:53:29] glad I could help:) [08:35:59] FYI, I need to disable Puppet in eqiad for ~ 5 mins [08:41:13] and back on [12:29:41] not sure if anyone has any advice for an issue I'm seeing. [12:29:48] may need to wait until volan.s is back Monday [12:29:51] hi all i have a meeting schedualed for 16:30 UTC today to go over some puppet CI stuff. The meeting was requested by search so there may be some specific search/puppet questions. however i also drafted a quick slide deck to give a genral overview with the aim of making this a future sre into talk. As such i think i9t could be genrally usefull fo any other new starters but also if there are any old [12:29:57] timers that want to come along that may also be ... [12:30:00] ... helpfull in keeping me honest and help inprove the talk for future sre sessions. [12:30:25] ping me if intrested and ill add yuo [12:30:35] topranks: whats the issue [12:30:36] super John yes please send on invite. [12:30:48] * btullis ping jbond [12:32:17] * jbond sent [12:32:52] See you there. Thanks. [12:32:53] I'm trying to reimage a server, but PXE boot is failing. [12:33:19] System is trying the first port on first NIC, but the switch has been connected up to the second port on the NIC. [12:33:35] So unsure if there is a way around this with the reimage cookbook / pushing drac settings. [12:33:51] Or if it's simply that DC-Ops should always use the first port and I'd be best just getting them to move it [12:35:00] topranks: ack let me check the code, this is something vola.ns has been working on recently [12:35:50] ok please go to no trouble. [12:36:10] and to clarify, what I'm trying to test here is our "normal install process" (in the new eqiad rows) [12:36:51] so definitely no call for some bespoke workaround [12:39:36] topranks: it looks like the logic for picking different nics is only in the provision cookbook [12:39:39] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/757410/1/cookbooks/sre/hosts/provision.py#231 [12:39:50] also that fix has not been deployed yet [12:40:38] ok thanks a mil [12:41:01] however i think that may be just about picking between embeded nic1 and pci 10g nics [12:41:07] if it was merged I could undo and reprovision, but as it's not I'll just ask DC-Ops to move the cable when they're online. [12:41:27] as such i think its worthpinging dc-ops to see if it should be in nic1 [12:41:36] yes myself and Riccardo we discussing that (embedded vs pci NICs) so doesn't surprise me if that is the focus of the patch [12:41:44] yeah I'll do that [12:41:49] topranks: can yuo let me know what dc ops says i.e. if this should work or if it should have been in nic1 [12:42:01] John C is going to be there today anyway [12:42:08] ack sg [12:42:20] jbond: yep no problem I'll pass on any feedback [12:42:26] cheers# [12:48:04] topranks: yes, I was told we always connect to the first port of the NIC of choice [12:48:15] so that's what the provision cookbook does [12:48:54] that said, if you need to change it you can do it via iDRAC either via racadm or via the web interface (with an ssh tunnel) [12:49:31] but be aware that this use case is not covered and so if there is a reason why is that way we need to adjust the automation accordingly [13:19:22] Thanks for confirming volan.a [13:19:58] I’ll get them to move the cables seems simplest [13:20:23] Cheers [17:16:46] What is the best way to answer the question "Do we still have Jessie boxes that we manage with puppet?", perhaps puppetboard? [17:18:18] jhathaway: https://puppetboard.wikimedia.org/fact/operatingsystemmajrelease [17:18:30] boom, thanks cdanis! [17:18:49] that's just production puppet of course [17:19:17] oh and there's also https://w.wiki/4rJR [17:19:21] (be already logged into grafana) [17:19:59] oooh, nice that is another good option [17:20:51] and to see which hosts are running a given version you can do `node_debian_version{version="9.13"}` for example [17:21:25] thanks [17:24:55] I was trying to get that info from https://puppetboard.wikimedia.org/query too, but it doesn't seem to like any query I put in. Am I doing something wrong? [17:30:05] not sure, I couldn't figure out the query languages either [17:31:48] the query language is horrifying. I'm also not sure that the web frontend for it works. [17:32:33] Cool. Just wanted to be sure I wasn't missing out on something :-) [17:32:48] I have made it work with curl directly on a puppetdb host, one sec [17:33:46] ah [17:33:50] well, https://phabricator.wikimedia.org/P8744 is similar but different [17:34:01] (dumps entire catalogs across the fleet and then searches them with `jq`) [17:35:29] there are some examples, possibly some of them working, in puppetdb1002:/home/cdanis/.zsh_history [17:36:19] Nice one, thanks. [17:39:25] My attempted server re-image appears to be failing, it mostly went ok but there were some quirks during the process [17:39:51] I'll pick it up with volan.s next week and get his input [17:39:57] Right now it's stuck at this stage: [17:39:58] [37/50, retrying in 111.00s] Attempt to run 'cookbooks.sre.hosts.reimage.ReimageRunner._populate_puppetdb..poll_puppetdb' raised: Nagios_host resource with title elastic1093 not found yet [17:40:13] Not sure if anyone is familiar but if I cancel this is it ok to leave that way? [17:40:27] Or should I take any other action to prevent any alerts etc about this host over the weekend? [18:04:25] topranks: I would say to downtime the host in icinga except presumably the host is not visible in icinga :) [18:04:49] andrewbogott: yeah thanks that was kind of where my thinking was at :) [18:07:29] the reimage has now failed so state shouldn't change, and it doesn't seem to have managed to add to Icinga or trigger any alerts. [18:07:34] so I think it'll be ok [18:29:59] I'm about to do some reimages so I hope that this was somehow unique to your server [19:09:18] jhathaway, btullis: you could use this "very friendly"™ syntax on https://puppetboard.wikimedia.org/query [19:09:21] ["extract", [["function","count"], "value"], ["=","name","lsbdistcodename"], ["group_by", "value"]] [19:09:35] gives you a 3-rows table with debian version and count [19:10:53] another quick option is to just run $ sudo cumin 'F:lsbdistcodename = bullseye' depending what you need [19:12:50] jhathaway: would have used sudo cumin 'foo*' 'lsb_release -c' [19:13:01] it gives you groups based on output [19:13:53] topranks: are you actively fixing it? I started to look but things have changed [19:13:57] while I was looking [19:15:06] ah topranks the host clock is way off: Fri 18 Feb 2022 01:13:37 PM UTC [19:15:21] jhathaway: yet another one is to sort by kernel version on https://puppetboard.wikimedia.org/inventory [19:15:25] and puppet NOOP failed because the certificate was not yet valid [19:16:48] so the catalog was never compiled and hence no exported resource and hence can't find Nagios_host on the icinga host for it [19:17:01] (on puppetdb direclty but it's the same) [19:18:00] topranks: you can retry the reimage without the --new and with the --no-pxe option, *but* I strongly advice to fix the clock first as it could have any sort of side effects. [19:19:44] thanks volans|off for the other options! [19:19:58] ah and to be clear, you have to select the 'facts' endpoint from the dropdown [19:20:10] in the puppetboard query page for my first option [20:16:57] volans: thanks, most of that makes sense. There was some other niggles during install that didn’t work right, I’m gone now catch up with you Monday on it cheers! [20:59:00] ack, ping me on monday [21:21:38] "second community meeting for scaling the Wikidata Query Service (WDQS) will take place on February 21st" this is nice to see on a WMF mail, and that it is a thing