[00:57:30] <JJMC89> Rook: for T379746, if you need specifics to secure PAWS, I can provide more info privately. [00:57:30] <stashbot> T379746: Cleanup miners - https://phabricator.wikimedia.org/T379746 [00:58:44] <Rook> No no need for specifics. I thought there was some way to block an IP is what I was wondering about IPs. We can add the security tag to that ticket if you think it is appropriate [01:01:24] <JJMC89> I can block IPs, which I have done for the proxies/web hosts. The largest group appears to be normal telecom provider though with other legitimate users. [01:03:14] <JJMC89> Reviewing at this rate isn't sustainable though. Something is likely going to need to be done from the PAWS/cloud side. [09:03:51] <dcaro> morning [09:40:26] <blancadesal> o/ [09:49:56] <arturo> o/ [11:05:33] <dcaro> quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/595 <- it's preventing running the tests in lima-kilo [11:05:36] <dcaro> (without manual changes) [11:09:35] <dcaro> dhinus: thanks, addressed your comments [11:10:00] <dhinus> dcaro: thanks, approved [11:15:19] <dhinus> I'm trying to figure out what's the best answer to this email, or in general to cloud vps users who want to experiment with prometheus metrics [11:15:23] <dhinus> https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/3SAMOJSJZBH64M3WPQJXXIUACKJPMBJA/ [11:15:53] <dhinus> we have some docs here suggesting it's ok-ish to push some custom metrics to the metricinfra prometheus https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus [11:16:42] <dhinus> but I feel like we need more docs on how to actually do it, or how to set up a custom prometheus instance in a project (if that makes sense) [11:17:02] <arturo> dhinus: yeah, same feeling here regarding using metricsinfra prometheus [11:17:20] <arturo> and also same, about additional docs [11:18:07] <dhinus> I guess the "easy" answer is pointing them to the -cloud IRC channel for help :D [11:19:01] <dhinus> but maybe I'll try to figure out how this example mentioned in wikitech is working https://libraryupgrader2.wmcloud.org/metrics [11:20:09] <dhinus> does prometheus.wmcloud.org scrape the /metrics URL on _all_ cloudvps vms? [11:21:57] <dcaro> I don't think so, both suggestions would be part of the metricsinfra project service (unfinished), allowing users to define alerts/metrics etc. [11:22:11] <dcaro> we would probably want to do some thinking on the offering we want to give [11:25:43] <dcaro> the config for the scrapes is in the prometheusconfig DB, that the metricsinfra controller uses [11:25:49] <dcaro> (the scrapes table) [11:26:38] <dcaro> I don't see that one though, looking [11:27:00] <dcaro> found it [11:28:45] <dcaro> https://www.irccloud.com/pastebin/A0T0GaLY/ [11:32:08] <dhinus> nice one, thanks [11:37:16] <dhinus> this is the epic task about allowing project admins to configure custom scrape targets: T284993 [11:37:17] <stashbot> T284993: Enable self-service Prometheus configuration management for project administrators - https://phabricator.wikimedia.org/T284993 [11:49:40] <dcaro> yep, I think that's the one yes [11:49:45] <dcaro> quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/62 [11:49:54] <dhinus> I've added some info to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus [11:50:40] <dcaro> thanks [11:50:52] <dcaro> (for the docs) [11:50:59] <dhinus> dcaro: approved the MR [11:51:19] <dcaro> and the review :) [11:54:47] <dcaro> oh, another quick one https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/596 [11:54:55] <dcaro> (so when I deploy the fix it's actually tested for) [12:00:58] <dcaro> gtg for lunch, will deploy the fix https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/63 after (anyone feel free to release before that) [12:01:12] <dhinus> +1d [14:49:42] <dcaro> I got a few quick reviews here https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests [14:52:43] <dcaro> and https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests?label_name%5B%5D=Needs+review [15:06:07] <dhinus> what is ntp-04 in project cloudinfra? "No Puppet resources found on instance ntp-04 on project cloudinfra" [15:07:16] <andrewbogott> dhinus: as far as I know all our VMs use ntp-03/04 time servers to sync clocks. [15:07:22] <andrewbogott> So that probably matters [15:07:52] <dhinus> looking [15:08:09] <dhinus> "Failed to open TCP connection to puppetmaster.cloudinfra.wmflabs.org:8140 (getaddrinfo: Temporary failure in name resolution)" [15:08:55] <andrewbogott> I'd check resolv.conf for starters, and then reboot it :) [15:09:03] <andrewbogott> btw, confirmed that those servers still matter: [15:09:07] <andrewbogott> [Time] [15:09:07] <andrewbogott> Servers=ntp-03.cloudinfra.eqiad1.wikimedia.cloud ntp-04.cloudinfra.eqiad1.wikimedia.cloud [15:09:16] <andrewbogott> (from a randomly selected VM) [15:09:42] <dhinus> name resolution is broken for any host [15:10:11] <dhinus> the nameserver is missing from /etc/resolv.conf [15:10:30] <dhinus> ntp-03 has "nameserver 172.20.255.1", in ntp-04 that line is missing [15:10:39] <dhinus> I'll try adding manually, then re-running puppet [15:10:46] <andrewbogott> that's interesting, has puppet been broken there for a year? [15:10:54] <andrewbogott> Easy to fix, but mysterious! [15:11:20] <dhinus> only broken for 960 minutes, apparently [15:11:22] <dhinus> :) [15:11:52] <andrewbogott> I would like to think that resolv.conf doesn't just randomly degrade :( [15:12:11] <dhinus> that line was removed by puppet itself, it's logged [15:12:23] <dhinus> "Applying configuration version '(4ac6bd9d7f) Eevans - Update corto puppetization'" [15:13:19] <dhinus> well the previous puppet run was at the same commit, and it worked [15:13:31] <dhinus> then it somehow decided that line had to go... [15:13:46] <andrewbogott> so probably a temporary hiera lookup failure... [15:15:35] <dhinus> looks likely [15:19:25] <andrewbogott> I'm going to try to make a safety net for this since it seems very bad [15:19:38] <andrewbogott> (although to be honest puppet should fail entirely if there's a hiera failure...) [15:20:16] <dhinus> the template is shared with prod so I'm surprised this hasn't happened before [15:23:17] <dhinus> I'm not finding any related task in phab, I'll open one for posterity [15:25:56] <andrewbogott> oh great, tell me the # and I'll attach this patch [15:28:34] <dhinus> T379927 [15:28:35] <stashbot> T379927: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927 [15:28:50] <dhinus> I marked it as "resolved", but feel free to reopen and attach the patch [15:31:10] <andrewbogott> heh, as always I want to cc the person who worked on this code last and of course it's jbond all the way down [15:31:32] <andrewbogott> https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249 [15:32:16] <dhinus> haha [15:32:21] * andrewbogott seeks pre-meeting breakfast [15:38:02] <jbond> andrewbogott: i don't think that will fix the issue. Its not obvious from the task, but is it possible thatthe nameservers variable contained ip addresses for the current host [15:38:08] <jbond> i ask as the template has the following [15:38:17] <jbond> `<% @_nameservers.reject{|ns| [@facts['ip'], @facts['ip6']].include?(ns) }.each do |nameserver| -%>` [15:42:01] <jbond> originally added in: https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d [15:43:39] <andrewbogott> jbond, you are kibo for the new millennium [15:44:19] <andrewbogott> So we could move that logic that removes localhost up into the .pp file and then check... [16:14:29] <jbond> andrewbogott: its wont remove localhost. it will remove the primary IP address [16:15:33] <jbond> I'd also suggest chatting to _joe_ about why they added the check initialy. however if the server is a dns server then i think it also make senses to use localhost and *not* $facts['ip'] [16:16:47] <jbond> i also worked with sukhe to add some special handeling for the production DNS/ntp servers so its also worth having a chat with them to see what we did there (assuming we finished it) [16:17:30] * jbond doesn't get the kibo reference [16:31:55] <_joe_> lol jbond you're too young [16:32:10] <_joe_> kibo was a fading legend when I joined newsgroups [16:32:37] <_joe_> he was some guy that ran some bot looking for mentions of him in any newsgroup, and he'd show up if mentioned [16:32:54] <_joe_> what was the task? [16:33:18] <jbond> ahh i seee nice to know im still too young for some things lol [16:33:27] <jbond> this was the commit https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d [16:33:42] <jbond> specifically "Remove all the $nameservers_override from the node definitions and add those to per-site, per-role hiera [16:34:05] <jbond> and this line https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d#diff-bb184c1bf60b3bafdb7cd2a60fe65b836f647fe25a3bf5227d26f48f1ff0e38bR9 [16:34:53] <jbond> the line has since changed but is mostly the same https://github.com/wikimedia/operations-puppet/blob/production/modules/resolvconf/templates/resolv.conf.erb#L10 [16:35:47] <jbond> sorry wrong comment i ment this one "exclude the IP of the current node from the list to avoid self-dependencies [16:36:05] <_joe_> yes so [16:36:12] <_joe_> this was originally done with the overrides [16:36:21] <_joe_> we didn't want a node being installed [16:36:39] <_joe_> having itself as a nameserver, when the nameserver software was still unconfigured [16:36:48] <_joe_> it causes all sorts of issues ofc [16:37:05] <_joe_> so unless you modify /etc/resolv.conf as the absolute last thing in puppet [16:37:23] <_joe_> or at least after the local dns server is set up [16:37:30] <_joe_> things might fail randomly to resolve [16:37:34] <_joe_> does this make sense? [16:37:41] <jbond> ack yes that yes that makes complet senses [16:39:09] <jbond> in that case andrewbogott i would speak with sukhe i was helping them solve thiswhen they were rebuilding the dns servers [16:39:36] <jbond> i can't rember if we finished it off but whatever we did there should work for you [16:39:58] <jbond> and i thnink if you just remove the line ou will hit the issue _joe_ describes above when rebuilding servers [16:40:07] <jbond> thanks _joe_ :D [16:43:36] * arturo offline [16:49:46] <andrewbogott> _joe_: the larger context is that we had a VM (not a nameserver) randomly remove its own resolver and I'm trying to prevent that from happening again with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249. Since it only happened once the safest approach may be to just do nothing as that patch was already a stab in the dark -- no idea why its nameserver list came up empty. [16:50:26] <_joe_> yeah not sure either, we can put a consistency check in there [16:54:51] <jbond> andrewbogott: from the task it looked like the server was an ntp server. in production the NTP serveres are the DNS servers. so are you sure it wasn't a namserver server? [16:55:49] <andrewbogott> yeah, they're different boxes in wmcs. Let me make sure there's not a name server installed there by accident... [16:56:50] <andrewbogott> doesn't look like it. [16:58:09] <jbond> ok one fix would be to remove the filter to reject the $facts['ip'] out of the template and into puppet. then you can do your patch. ill comment on the CR [16:59:13] <andrewbogott> thanks! [17:02:22] * jbond done [17:33:25] <_joe_> andrewbogott: please ping me for a review before merging [17:35:00] <andrewbogott> will do [17:36:18] <_joe_> FTR, ofc there's a wikipedia page [17:36:19] <_joe_> https://en.wikipedia.org/wiki/James_%22Kibo%22_Parry [17:36:54] <jbond> :) thanks [17:50:54] <andrewbogott> _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249, not at all time critical [18:13:51] <dcaro> jbond: nice seeing you around :) [18:14:00] * dcaro off