[07:13:21] <_joe_> btullis: I see turnilo is still down, do you have an ETA for when it will be back online? [08:26:46] _joe_: I'll look into it this morning. Something to do with what razzi was working on, I believe. [08:27:01] <_joe_> btullis: yes, it's been down since friday [08:27:10] is there a task for it? [08:27:14] <_joe_> at least getting an expected ETA would be helpful [08:27:26] <_joe_> marostegui: it went down as part of a change that was done on friday [08:27:33] yeah I know [08:27:37] <_joe_> so it's not like it broke [08:27:55] <_joe_> meaning the relevant task should be the one for the upgrade [08:27:55] What I am wondering is if there's a place we can watch/subscribe [08:28:00] So we don't have to ping people specifically :) [08:35:13] Yes, I see it was acknowledged in Icinga against the upgrade ticket: https://phabricator.wikimedia.org/T301990 by m.utante [08:36:51] razzi did some work yesterday upgrading and downgrading various an-tool hosts, but I'll see if there is anything I can do to get turnilo production back up today. [08:41:36] <_joe_> btullis: thanks a lot, but it's also ok if we get a status update later in the day <3 [08:42:55] _joe_: Understood, thanks. I'll let you know. [09:12:17] _joe_: marostegui - I think that turnilo is back up and running at version 1.35 now. Are you able to confirm? [09:12:47] btullis: works for me! [09:12:55] <_joe_> it is up, but [09:12:59] it opens but I don't see some of the ashboards [09:13:04] <_joe_> it seems it's missing dasboards [09:13:07] <_joe_> yep [09:13:08] I see the dashboards, but with no data [09:13:33] or data in the ones I'm opening [09:13:36] <_joe_> marostegui: some have data [09:13:45] <_joe_> see event_navigationtiming has data [09:13:54] OK, thanks all. I'll carry on investigating. [09:14:20] yes I was able to get some data on some dashboard, but the usual ones I use are missing [09:32:19] https://phabricator.wikimedia.org/T304898 do we really need such a big amount of logs? ie: -rw-r----- 1 root adm 4.7G May 17 00:01 puppetmaster.puppet.log.1 [09:32:43] we could purge some of the 256M files (puppetlogs 1 to 30) [09:34:32] re: turnilo - Could check against the list of dashboards that I see please, to verify whether those you aexpect to se are missing from here as well? https://phabricator.wikimedia.org/T301990#7933800 [09:35:45] btullis: I see the same list as yours if that's what you're asking [09:36:10] https://turnilo.wikimedia.org/#wmf_netflow and https://turnilo.wikimedia.org/#webrequest_sampled_128/ are the two I use regularly that are missing [09:36:14] Thanks yes, and the ones you use are missing from that list? [09:36:23] one that comes to mind is wmf_netflow [09:36:36] and the pageviews ones [09:36:44] I don't recall the exact names, let me check wikitech [09:36:56] OK, thanks both. Will continue investigating. [09:37:33] marostegui: Most of the entries don't look that useful either. [09:41:33] cc moritzm, jbond for the puppetmaster logs question ^^^ [09:52:14] there are ~8GB in /tmp/tmp.* directories, they all seems to be related to refresh of the debian installer, I guess they can be all deleted [09:54:36] let me check [09:55:02] yeah, I'm cleaning those out [09:55:37] moritzm: can you take a look also at the logs and delete them if that's ok? [09:58:44] back to 64% just with those tmp gone :D [09:59:08] yeah, I don't think dropping logs is needed actually, with the cleanup of netinst leftovers I just did we're back to 64% disk usage [09:59:22] thanks, I will close the task [09:59:30] and puppetmaster1001 is up for replacement as well, the new hardware has already arrived "just" needs the switchover of the servers [10:00:27] pm1001 currently has 2x200 SSDs and the replacements 2x960 [10:00:38] that's an improvement! [11:53:53] turnilo update - there is a temporary workaround in place which gets the dashboards that you have mentioned back [11:54:33] However there is still an ongoing configuration issue because it now doesn't display autodetected cubes. I'm still working on this. [12:18:01] thanks for the update [12:34:04] btullis: thanks! I can't see wmf_netflow though [12:35:20] Sorry. I was troubleshooting again. I have someone from the Turnilo Slack looking into it with me. `wmf_netflow` should be back again now. [12:36:25] btullis: ok! it's fine if it's not there when you're working on it btw [15:33:24] jbond, mutante, what happened with https://gerrit.wikimedia.org/r/c/operations/puppet/+/791677 ? [15:34:48] andrewbogott: andrewbogott: nothing has happened as of yet but as per the comments those certs are currently in use [15:35:03] jbond: ok, but you abandoned the revert? [15:35:33] no the revert was of the Cr you merged in relation to ldap-labs (not ldap-corp). [15:36:01] i created the revert as i thought 791677 (ldap-corp) was megred and wanted it reverted as the certs are in use [15:36:35] when i relised it was a different change i abandoned the revert [15:36:44] aaaah ok. So the cr I merged didn't break things, it's just 791677 which should not be merged. [15:36:59] I also am apparently unable to tell the difference between ldap-labs and ldap-corp when reading [15:37:07] :) [15:37:20] i didn;t check labs i assumed you ran puppet there? [15:37:31] but yes the corp one shouldn;t get merged just yet [15:37:58] cool. And yeah, the ldap-labs things were renamed recently so I'm 99% sure those certs are now meaningless. [15:38:05] ack cool [15:38:13] * andrewbogott checks puppet run state on serpens just to be sure [15:39:12] yeah, seems happy [15:39:17] great [15:39:20] thank you for explaining [15:39:25] np [17:10:37] I am basically taking my lunch break early today. Got something todo outside and will be back in ~ 1.5 hrs. [17:16:44] !log ganeti4003 rebooting for firmware updates via T307997 [17:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:49] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [18:53:49] jbond, around? I'm have a puzzle re: https://gerrit.wikimedia.org/r/c/operations/puppet/+/714975 [18:58:56] (that puzzle being, why is ::nrpe not included on cloud VMs anymore, and how is it getting included on production hosts? I see it everywhere but absolutely can't figure out where puppet is including it.) [19:02:52] andrewbogott: profile::base::production (as in the latest version in git) includes class nrpe via profile::monitoring, which is gated behind the hiera key profile::base::production::enable which is true in prod and false in cloud [19:03:46] worth noting that in codfw1dev I've had puppet errors about /usr/local/lib/nagios/plugins/ missing, previously thought it was a codfw1dev specific thing but we might just have an old enough base image in eqiad1 that the error is not visible [19:03:47] oh, there it is, 'class { 'nrpe':' [19:03:55] I grepped and grepped but didn't find that [19:04:16] taavi: yes, that's exactly the issue I'm trying to fix. It's not codfw1dev specific but rather specific to using a recently-created base image [19:04:29] Because that directory used to be included vi ::nrpe [19:04:34] s/vi/via/ [19:05:40] Do you have an opinion about what the right fix is? Include ::nrpe everywhere, or include profile::monitoring everywhere, or... ? [19:06:29] uhh good question [19:07:18] aiui we're trying to get rid of icinga/nrpe long term, so the ideal solution imo would be to only provision those scripts on hosts with nrpe installed [19:08:05] that could be done like ferm::conf work atm (exported file resource with a tag), but would take some effort [19:08:12] It does seem like we could just have nrpe::check require nrpe and then we'd get it if/when we need it [19:08:37] Hm... if there's a way to do that in puppet without resulting in duplicate definitions [19:09:29] or as a very quick hack, declare those directories in profile::wmcs::instance [19:09:41] I'd like to avoid installing nrpe to hosts that don't use it [19:09:47] That works as long as we know that ::nrpe will /never/ be included... [19:10:00] that's why I said "very quick hack" [19:10:14] But the fact that we're seeing those errors means that VMs are already installing nrpe things everywhere, right? So maybe that's what I should be looking at. [19:10:53] VMs are trying to install scripts called by NRPE to a directory that's only present on hosts with NRPE present [19:11:15] Right, but why install those scripts if we don't have nrpe present? [19:11:57] exactly my point [19:12:41] right now the 'systemd' class installs an nrpe plugin [19:13:39] among other things... [19:16:01] let me draft up of what I was talking about [19:21:23] I think I understand, have a patch in the works... [19:21:27] if I can ever get 'git review' to complete [19:21:53] andrewbogott: https://gerrit.wikimedia.org/r/792700 [19:24:40] taavi: won't that cause a resource collision on all prod hosts? [19:24:57] (My much-less-elegant solution is https://gerrit.wikimedia.org/r/c/operations/puppet/+/792701 ) [19:25:00] why would it? [19:28:49] I think there's something I'm not understanding in your patch. Where is the conditional piece? [19:28:59] I'm unfamiliar with the @file syntax, for one thing [19:30:01] that's a puppet 'virtual resource', practically meaning that it's only applied when something explicitely tells it to be applied [19:30:14] in this case thta 'something' is the nrpe class [19:30:43] oh, fancy! [19:30:44] so anything anywhere can declare a nrpe::plugin, which will declare that file resource on hosts with nrpe installed and do nothing on hosts that don't have nrpe intsalled [19:31:16] Well it's certainly more comprehensive than my suggestion... I guess we might be back to waiting on jbond to endorse. [19:31:33] I will write a phab task to provide context [19:31:39] the problem is that you now need to move everything to the new define for it to take effect [19:31:50] sounds good! [19:37:33] T308601 [19:37:33] T308601: Puppet fails on new cloud-vps VMs (with new base images) due to wanting /usr/local/lib/nagios/plugins - https://phabricator.wikimedia.org/T308601 [20:06:10] taavi, andrewbogott: My knowledge may be stale, but I don't think that exported resources & resource collectors work as hoped in most Cloud VPS projects. I think you have to have a project local puppetmaster & a puppetdb instance to make them actually do things. [20:07:22] definitely true that exported resources don't work (they break multi-tenancy). But taavi's suggestion is only about effect within a single catalog I think? [20:07:44] Although if it relies on the same mechanism it will fail [20:12:54] *nod* It may be that `@...` and `<| ... |>` work (staying in the same catalog) and only `@@...` and `<<| ... |>>` require puppetdb (which requires local project setup) [20:14:14] The main cases from the past that I'm thinking of were the latter for sure (ssh host keys and grid engine node registration) [20:17:03] bd808: 'virtual resources' (one @) work just fine, while 'exported resources' (two @s) indeed need puppetdb [20:17:31] look at ferm::conf for example, which is already used in the cloud realm [20:19:07] Thanks for indulging my drive-by half-remembered rambling taavi and andrewbogott. :) [20:20:51] BTW I find all the @<| stuff to be syntactically horrifying and ignore it as much as possible [20:21:52] the sugary syntax for a seldom used feature is certainly an easy way to cause confusion [20:22:51] I think o.ri and I made some messes with it in mediawiki-vagrant at one point. [20:26:42] almost every day I remember that there are graphing tools for puppet and I wonder why we don't use them but I've managed to postpone actually learning about and setting them up for the last 4000 days. [20:44:20] ROFL, I thought I was the only one in the remembering/forgetting puppet visualizations timeloop [20:46:06] andrewbogott: taavi: have read back log will check CR's tomrrow but the idea seems sound to me [20:47:38] as to graphing the catalog dependencies i have not found a tool that makes it look nice and tend to prefer just looking at the raw cataloge. althugh i think jhath.away may have mentioned they had some success in this area [20:51:45] Yeah, I can imagine I might need a 1000" monitor to usefully view the graph