[00:57:30] <JJMC89>	 Rook: for T379746, if you need specifics to secure PAWS, I can provide more info privately.
[00:57:30] <stashbot>	 T379746: Cleanup miners - https://phabricator.wikimedia.org/T379746
[00:58:44] <Rook>	 No no need for specifics. I thought there was some way to block an IP is what I was wondering about IPs. We can add the security tag to that ticket if you think it is appropriate 
[01:01:24] <JJMC89>	 I can block IPs, which I have done for the proxies/web hosts. The largest group appears to be normal telecom provider though with other legitimate users.
[01:03:14] <JJMC89>	 Reviewing at this rate isn't sustainable though. Something is likely going to need to be done from the PAWS/cloud side.
[09:03:51] <dcaro>	 morning
[09:40:26] <blancadesal>	 o/
[09:49:56] <arturo>	 o/
[11:05:33] <dcaro>	 quick review https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/595 <- it's preventing running the tests in lima-kilo
[11:05:36] <dcaro>	 (without manual changes)
[11:09:35] <dcaro>	 dhinus: thanks, addressed your comments
[11:10:00] <dhinus>	 dcaro: thanks, approved
[11:15:19] <dhinus>	 I'm trying to figure out what's the best answer to this email, or in general to cloud vps users who want to experiment with prometheus metrics
[11:15:23] <dhinus>	 https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thread/3SAMOJSJZBH64M3WPQJXXIUACKJPMBJA/
[11:15:53] <dhinus>	 we have some docs here suggesting it's ok-ish to push some custom metrics to the metricinfra prometheus https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus
[11:16:42] <dhinus>	 but I feel like we need more docs on how to actually do it, or how to set up a custom prometheus instance in a project (if that makes sense)
[11:17:02] <arturo>	 dhinus: yeah, same feeling here regarding using metricsinfra prometheus
[11:17:20] <arturo>	 and also same, about additional docs
[11:18:07] <dhinus>	 I guess the "easy" answer is pointing them to the -cloud IRC channel for help :D
[11:19:01] <dhinus>	 but maybe I'll try to figure out how this example mentioned in wikitech is working https://libraryupgrader2.wmcloud.org/metrics
[11:20:09] <dhinus>	 does prometheus.wmcloud.org scrape the /metrics URL on _all_ cloudvps vms?
[11:21:57] <dcaro>	 I don't think so, both suggestions would be part of the metricsinfra project service (unfinished), allowing users to define alerts/metrics etc.
[11:22:11] <dcaro>	 we would probably want to do some thinking on the offering we want to give
[11:25:43] <dcaro>	 the config for the scrapes is in the prometheusconfig DB, that the metricsinfra controller uses 
[11:25:49] <dcaro>	 (the scrapes table)
[11:26:38] <dcaro>	 I don't see that one though, looking
[11:27:00] <dcaro>	 found it
[11:28:45] <dcaro>	 https://www.irccloud.com/pastebin/A0T0GaLY/
[11:32:08] <dhinus>	 nice one, thanks
[11:37:16] <dhinus>	 this is the epic task about allowing project admins to configure custom scrape targets: T284993
[11:37:17] <stashbot>	 T284993: Enable self-service Prometheus configuration management for project administrators - https://phabricator.wikimedia.org/T284993
[11:49:40] <dcaro>	 yep, I think that's the one yes
[11:49:45] <dcaro>	 quick review: https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/62
[11:49:54] <dhinus>	 I've added some info to https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Metricsinfra_Prometheus
[11:50:40] <dcaro>	 thanks
[11:50:52] <dcaro>	 (for the docs)
[11:50:59] <dhinus>	 dcaro: approved the MR
[11:51:19] <dcaro>	 and the review :)
[11:54:47] <dcaro>	 oh, another quick one https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/596
[11:54:55] <dcaro>	 (so when I deploy the fix it's actually tested for)
[12:00:58] <dcaro>	 gtg for lunch, will deploy the fix https://gitlab.wikimedia.org/repos/cloud/toolforge/tools-webservice/-/merge_requests/63 after (anyone feel free to release before that)
[12:01:12] <dhinus>	 +1d
[14:49:42] <dcaro>	 I got a few quick reviews here https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests
[14:52:43] <dcaro>	 and https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests?label_name%5B%5D=Needs+review
[15:06:07] <dhinus>	 what is ntp-04 in project cloudinfra? "No Puppet resources found on instance ntp-04 on project cloudinfra"
[15:07:16] <andrewbogott>	 dhinus: as far as I know all our VMs use ntp-03/04 time servers to sync clocks.
[15:07:22] <andrewbogott>	 So that probably matters
[15:07:52] <dhinus>	 looking
[15:08:09] <dhinus>	 "Failed to open TCP connection to puppetmaster.cloudinfra.wmflabs.org:8140 (getaddrinfo: Temporary failure in name resolution)"
[15:08:55] <andrewbogott>	 I'd check resolv.conf for starters, and then reboot it :)
[15:09:03] <andrewbogott>	 btw, confirmed that those servers still matter:
[15:09:07] <andrewbogott>	 [Time]
[15:09:07] <andrewbogott>	 Servers=ntp-03.cloudinfra.eqiad1.wikimedia.cloud ntp-04.cloudinfra.eqiad1.wikimedia.cloud
[15:09:16] <andrewbogott>	 (from a randomly selected VM)
[15:09:42] <dhinus>	 name resolution is broken for any host
[15:10:11] <dhinus>	 the nameserver is missing from /etc/resolv.conf
[15:10:30] <dhinus>	 ntp-03 has "nameserver 172.20.255.1", in ntp-04 that line is missing
[15:10:39] <dhinus>	 I'll try adding manually, then re-running puppet
[15:10:46] <andrewbogott>	 that's interesting, has puppet been broken there for a year?
[15:10:54] <andrewbogott>	 Easy to fix, but mysterious!
[15:11:20] <dhinus>	 only broken for 960 minutes, apparently
[15:11:22] <dhinus>	 :)
[15:11:52] <andrewbogott>	 I would like to think that resolv.conf doesn't just randomly degrade :(
[15:12:11] <dhinus>	 that line was removed by puppet itself, it's logged
[15:12:23] <dhinus>	 "Applying configuration version '(4ac6bd9d7f) Eevans - Update corto puppetization'"
[15:13:19] <dhinus>	 well the previous puppet run was at the same commit, and it worked
[15:13:31] <dhinus>	 then it somehow decided that line had to go...
[15:13:46] <andrewbogott>	 so probably a temporary hiera lookup failure...
[15:15:35] <dhinus>	 looks likely
[15:19:25] <andrewbogott>	 I'm going to try to make a safety net for this since it seems very bad
[15:19:38] <andrewbogott>	 (although to be honest puppet should fail entirely if there's a hiera failure...)
[15:20:16] <dhinus>	 the template is shared with prod so I'm surprised this hasn't happened before
[15:23:17] <dhinus>	 I'm not finding any related task in phab, I'll open one for posterity
[15:25:56] <andrewbogott>	 oh great, tell me the # and I'll attach this patch
[15:28:34] <dhinus>	 T379927
[15:28:35] <stashbot>	 T379927: Puppet removed "nameserver" line from /etc/resolv.conf - https://phabricator.wikimedia.org/T379927
[15:28:50] <dhinus>	 I marked it as "resolved", but feel free to reopen and attach the patch
[15:31:10] <andrewbogott>	 heh, as always I want to cc the person who worked on this code last and of course it's jbond all the way down
[15:31:32] <andrewbogott>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249
[15:32:16] <dhinus>	 haha
[15:32:21] * andrewbogott seeks pre-meeting breakfast
[15:38:02] <jbond>	 andrewbogott: i don't think that will fix the issue.  Its not obvious from the task, but is it possible thatthe nameservers variable contained ip addresses for the current host
[15:38:08] <jbond>	 i ask as the template has the following 
[15:38:17] <jbond>	  `<% @_nameservers.reject{|ns| [@facts['ip'], @facts['ip6']].include?(ns) }.each do |nameserver| -%>` 
[15:42:01] <jbond>	 originally added in: https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d
[15:43:39] <andrewbogott>	 jbond, you are kibo for the new millennium
[15:44:19] <andrewbogott>	 So we could move that logic that removes localhost up into the .pp file and then check...
[16:14:29] <jbond>	 andrewbogott: its wont remove localhost.  it will remove the primary IP address
[16:15:33] <jbond>	 I'd also suggest chatting to _joe_ about why they added the check initialy.  however if the server is a dns server then i think it also make senses to use localhost and *not* $facts['ip']
[16:16:47] <jbond>	 i also worked with sukhe to add some special handeling for the production DNS/ntp servers so its also worth having a chat with them to see what we did there (assuming we finished it)
[16:17:30] * jbond doesn't get the kibo reference
[16:31:55] <_joe_>	 lol jbond you're too young
[16:32:10] <_joe_>	 kibo was a fading legend when I joined newsgroups
[16:32:37] <_joe_>	 he was some guy that ran some bot looking for mentions of him in any newsgroup, and he'd show up if mentioned
[16:32:54] <_joe_>	 what was the task?
[16:33:18] <jbond>	 ahh i seee nice to know im still too young for some things lol
[16:33:27] <jbond>	 this was the commit https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d
[16:33:42] <jbond>	 specifically "Remove all the $nameservers_override from the node definitions and add those to per-site, per-role hiera
[16:34:05] <jbond>	 and this line https://github.com/wikimedia/operations-puppet/commit/60f6abd82915524d3db11f59e5aabbce5e42d78d#diff-bb184c1bf60b3bafdb7cd2a60fe65b836f647fe25a3bf5227d26f48f1ff0e38bR9
[16:34:53] <jbond>	 the line has since changed but is mostly the same https://github.com/wikimedia/operations-puppet/blob/production/modules/resolvconf/templates/resolv.conf.erb#L10
[16:35:47] <jbond>	 sorry wrong comment i ment this one "exclude the IP of the current node from the list to avoid self-dependencies
[16:36:05] <_joe_>	 yes so
[16:36:12] <_joe_>	 this was originally done with the overrides
[16:36:21] <_joe_>	 we didn't want a node being installed
[16:36:39] <_joe_>	 having itself as a nameserver, when the nameserver software was still unconfigured
[16:36:48] <_joe_>	 it causes all sorts of issues ofc
[16:37:05] <_joe_>	 so unless you modify /etc/resolv.conf as the absolute last thing in puppet
[16:37:23] <_joe_>	 or at least after the local dns server is set up
[16:37:30] <_joe_>	 things might fail randomly to resolve
[16:37:34] <_joe_>	 does this make sense?
[16:37:41] <jbond>	 ack yes that yes that makes complet senses
[16:39:09] <jbond>	 in that case andrewbogott i would speak with sukhe i was helping them solve thiswhen they were rebuilding the dns servers
[16:39:36] <jbond>	 i can't rember if we finished it off but whatever we did there should work for you
[16:39:58] <jbond>	 and i thnink if you just remove the line ou will hit the issue _joe_ describes above when rebuilding servers
[16:40:07] <jbond>	 thanks _joe_ :D
[16:43:36] * arturo offline
[16:49:46] <andrewbogott>	 _joe_: the larger context is that we had a VM (not a nameserver) randomly remove its own resolver and I'm trying to prevent that from happening again with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249.  Since it only happened once the safest approach may be to just do nothing as that patch was already a stab in the dark -- no idea why its nameserver list came up empty.
[16:50:26] <_joe_>	 yeah not sure either, we can put a consistency check in there
[16:54:51] <jbond>	 andrewbogott: from the task it looked like the server was an ntp server.  in production the NTP serveres are the DNS servers.  so are you sure it wasn't a namserver server?
[16:55:49] <andrewbogott>	 yeah, they're different boxes in wmcs. Let me make sure there's not a name server installed there by accident...
[16:56:50] <andrewbogott>	 doesn't look like it.
[16:58:09] <jbond>	 ok one fix would be to remove the filter to reject the $facts['ip'] out of the template and into puppet.  then you can do your patch.  ill comment on the CR
[16:59:13] <andrewbogott>	 thanks!
[17:02:22] * jbond done
[17:33:25] <_joe_>	 andrewbogott: please ping me for a review before merging
[17:35:00] <andrewbogott>	 will do
[17:36:18] <_joe_>	 FTR, ofc there's a wikipedia page
[17:36:19] <_joe_>	 https://en.wikipedia.org/wiki/James_%22Kibo%22_Parry
[17:36:54] <jbond>	 :) thanks
[17:50:54] <andrewbogott>	 _joe_: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1091249, not at all time critical
[18:13:51] <dcaro>	 jbond: nice seeing you around :)
[18:14:00] * dcaro off