[11:54:07] hi folks [11:54:46] our backlog of topics for our SRE Monday meeting(s) is currently empty [11:54:57] our next meeting is this coming Friday, June 27th [11:55:03] er, coming Monday, obviously [11:59:55] if you are taking requests six months in to the future, please sign me up :) [12:00:32] topic: Encrypted Client Hello [12:01:00] I have ready a presentation of WMF backups overview (different from the bacula one) but putting me on the back of the queue in case there is something more current [12:02:04] please put those in the queue in the doc, alongside suggested dates [12:02:13] ok [12:03:31] I can do like last time, keep it around and fill in in case there is no one else available [13:44:35] I need to find some time away from urgent things to write a dgit talk... [14:56:23] Sanity check: We don't have any way to get vm.max_map_count from our current metrics, correct? [14:56:51] (prometheus metrics) [15:06:10] brett: correct, AFAICS that isn't in the set of metrics exported by node-exporter [15:06:37] I checked with curl localhost:9100/metrics | grep -i vm FWIW [15:08:11] godog: Thanks for checking. I was just dearly hoping that it was exported under a different name that I wasn't able to locate :( [15:09:13] brett: ikr? I got my hopes up reading https://github.com/prometheus/procfs/pull/176/files but alas no sign of that metric [15:09:22] in node-exporter metrics that is [15:19:24] brett: just curious, why would that metric be helpful? [15:20:22] jhathaway: For migrating over a varnishd-mmap-count monitor from icinga to prometheus. Icinga gets the value passed in via puppet whereas AFAICT we don't have that luxury in the alerts repo [15:20:32] https://phabricator.wikimedia.org/T300723#8006759 [15:21:36] *alertmanager, not prometheus [15:28:48] brett: makes sense, has the existing alert ever fired, since the value was raised? [15:33:34] I have to use 'sudo -g' to preserve group permissions on a file and it's prompting me for a password. Would that be my wikitech pw? Or is pw even an option for prod? [15:36:23] no [15:37:08] inflatador: I think it is prompting you for root's password [15:37:35] nobody else has their password set in /etc/shadow [15:37:53] nbd, I can work around it, just have to remember to reset perms before I restart elasticsearch [15:38:23] it doesn't use a systemd unit? [15:40:59] it does...but we're running a very custom setup. Are you saying that systemd can reset permissions on a file every time it starts/restarts a service? [15:41:28] there is a preexec option I believe [15:42:03] Cool, I'll take a look. That would keep elastic from blowing up every time we upgrade [15:43:44] inflatador: ExecStartPre man systemd.service is one option [16:49:13] jhathaway: errr, not sure how to view historical alert fires in icinga (or at least find the needle in the haystack of alerts) :( [16:50:32] brett: no big deal, I was just wondering whether there was value in the alert, or whether after bumping the value it was very unlikely to occur, or if it did there would be other problems such as running out of memory [16:58:50] you can search in the icinga dashboard on logstash: https://logstash.wikimedia.org/app/dashboards#/view/AWm67Kpk8aQffZ3HmRpW?_g=h@865c245&_a=h@772ca2e [16:59:46] or IRC logs too :D [17:28:21] brett: can you identify the alert in icinga web UI? If you do there is "View Alert History For This Service" in the upper left corner of the frame [17:49:13] that covers just the last syslog (~24h) [18:13:18] looking at 'prometheus::blackbox::check::http' there is a $team parameter and the default is 'sre'. Would you know where to find the list of other valid team strings? Or does that first have to be created? [18:19:23] yeah I don't see it anywhere else [18:20:52] ACK, asking via email [18:25:11] mutante: fwiw I see these other values in use on alertmanager, for alerts that are currently active: https://usercontent.irccloud-cdn.com/file/0yt382Pd/image.png [18:25:23] not sure offhand where to find a complete list of the options though [18:25:47] oh there it is, modules/alertmanager/templates/alertmanager.yml.erb [18:26:28] ah! [18:26:31] rzl: :) thank you [18:26:54] it looks what I really want is write a custom "receiver" first [18:27:10] I want task + email + IRC + specific subteam [18:27:14] (found that file just by grepping for a handful of those team names all in one place) [18:27:15] but not pages [18:28:15] I should add that in https://gerrit.wikimedia.org/r/c/operations/puppet/+/807176/1/modules/prometheus/manifests/blackbox/check/http.pp [20:23:31] does anyone know if it's possible to use cumin interactively, a la clusterSSH? [20:52:54] inflatador: not that I am aware of, but v.olans would be able to answer authoritatively [20:53:18] how many machines do you want to control? [20:53:34] ~36 hosts [20:53:45] all the elasticsearch hosts in a single DC, that is [20:55:06] ah, that is quite a few, you could use tmux synchronized panes, but your panes would be very small [20:56:29] maybe you want an "exec" in puppet instead? it can use "onlyif" condition like "only if certain file does not exist" or whatever. this can often fix those "needs to run only once but on all hosts" setups [20:57:36] AFAIK I have to respond interactively, I guess I could try something with expect maybe? It's elasticsearch's keystore which is a flat file on every single instance. Has to be the same everywhere for the plugin to work [20:59:19] you can script tmux [20:59:42] I see. I mean.. there is still the oldschool way. Make a list of hosts and "for host in ... ; ssh .. -C; done" [21:00:23] yeah, I may end up doing that [21:00:26] tbh I'd rather do that before trying expect [21:00:29] looks like this guy has a good suggestion https://www.aldnav.com/blog/adding-s3-keys-to-elasticsearch-keystore/ [21:02:00] ah, echo piped into the keystore command. that looks right. should remove the interactive part [21:02:06] then you can go back to cumin I guess [21:02:16] didn't realize it could accept input from stdin, that makes it easier [21:02:34] yeah [21:03:09] can you put the secret bits in private puppet? [21:03:22] than pipe the secret bits in with puppet? [21:03:49] I guess cergen already does that [21:03:57] Haven't thought that far ahead yet ;P [21:04:07] :) [21:04:46] but then cergen is supposed to be replaced with 'cfssl pki' which is profile::pki::get_cert [21:04:49] heh [21:05:18] Not that it's a solution here, but I have been playing around with hashicorp vault lately, pretty nice for PKI [21:07:02] puppet$ grep -r pki::get_cert * [21:07:16] this shows other services using the most modern way to get certs for wmf services, afaict [21:07:25] there are like 2 or 3 older ways [21:07:37] does it work for arbitrary secrets? These aren't certs [21:08:58] oh, I don't know. I assumed too much just because the keystore command was involved [21:09:45] puppet generally does support arbitrary secrets though via ./modules/secret/secrets/ in the private repo [21:09:48] no worries, I bet it's doable with private puppet [21:09:52] yea