[07:23:31] mutante: what about `prometheus_nodes`? do you need all the nodes or only the local ones? (and it's defined (empty set) for cloud) [07:55:28] also the site's prometheus hosts already have access to all ports via standard firewall rules, what problem are you trying to solve mutante ? [10:16:35] Last reminder that I'll be running the switchover live test at 11:00 UTC [10:16:42] ack [10:16:51] * chaosmonkey steps back from the keyboard [10:16:52] If there's any objections, better voice them now :P [10:17:14] chaosmonkey: "Put the keyboard down SLOWLY" [10:17:55] claime: don't shout at me.. I'm chaos after all [10:18:09] chaosmonkey: It's not me, it's the sre police [10:18:13] I'm not responsible [12:21:57] volans: what's a script in SRE cookbook terms? [12:23:26] volans: SREBatchRunnerBase pre/post_script() should return a list of scripts, [12:24:09] from sre/misc-clusters/example.txt it seems that's expecting a list of python methods? [12:24:26] return [self.script_example] [12:24:37] being script_example def script_example(self, hosts) [12:25:11] vgutierrez: pre/post_script() are a list of commands to execute on the target hosts, in instead you want to run some python function you have to use pre/post_action [12:25:36] nope, I wanna run some CMDs before restarting haproxy and some after [12:25:47] than that's the one [12:26:41] but I don't know if the example isn't accurate or am I too thick [12:27:02] vgutierrez: see for example [12:27:03] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/gitlab/reboot-runner.py#40 [12:27:29] volans: right [12:27:43] yes, the example.txt is outdated apparently (cc jbond ) [12:27:50] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/sre/misc-clusters/example.txt#72 [12:27:53] that's misleading IMHO [12:27:55] all that part should move to spicerack and have documenttion auto-generated [12:27:58] yes [12:28:32] thanks for the example <3 [12:29:49] sorry for the confusion [12:30:46] I might need an action after all... cause I'm assuming that run-puppet-agent isn't kosher on a cookbook :) [12:33:13] what do you mean? [12:33:34] ah you need to run puppet [12:33:43] return ['depool', 'apt-get update', 'apt-get install haproxy', 'run-puppet-agent -q'] seems reasonable? [12:33:55] or should I trigger a puppet run on a more idiomatic way? [12:34:09] depool? if you're using a load balanced service use SRELBBatchRunnerBase instead [12:35:16] noted [12:35:47] but... I don't think I can use that [12:36:02] * jbond will create a CR to move theses cookbookbase classes to spicerack this week [12:36:38] SRELBBatchRunnerBase performs the depool as part of the action [12:36:56] but I need it to be done as part of the pre action [12:36:59] if that makes sense :) [12:38:18] ah, yeah, you still could overwriting self._restart_daemons or self._reboot, bot that's a bit hacky [12:38:21] and feels wrong [12:38:58] it might be worth adding an upgrade package action thats probably common enough that its worth adding (not helpfull now though) [12:55:11] volans, jbond first naive approach on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/891267, let's move the discussion there :) [12:55:48] from a very high level, I see a parttern and a big question- how to "alert" on normal, but unintended states? That's very hard!? [14:08:48] volans: thanks for resolving T329773! as per your comment, I wanted to let you know that I will be reimaging a dnsrec host in ~30 mins [14:08:49] T329773: spicerack dnsdisc.Discovery attempts to query depooled/disabled dns auth servers - https://phabricator.wikimedia.org/T329773 [14:09:35] sukhe: ack, it should all be good, but because today is our no-no day who know what will happen :D [14:09:42] haha [14:09:51] * sukhe postpones reimaging for tomorrow [14:09:53] :) [14:10:09] lol, up to you :) [14:10:44] what can I say, I am believer in Murphy's Law :P [14:12:14] ahahaha [14:16:29] That's a very spicy day [14:16:31] for sure [14:19:36] https://www.youtube.com/watch?v=ErgdUhZteqw [14:22:25] godog: not what I was expecting but I will take it :) [14:22:56] haha sukhe ! you are welcome [14:30:50] volans: re kafka stretch, we've been waiting for them to be setup since they've been received (but also not had time to use them yet, so haven't pushed). [14:33:35] it's just that sends an email every day to root@, so someething is clearly up on that one but not fully puppetized [16:56:51] XioNoX: yes, I needed prometheus_nodes and not monitoring_servers. that was actually the answer [16:57:29] godog: I was fixing "connection refused" for prometheus nodes to contint servers, using port 1443 [16:58:06] ci::firewall previously only allowed cp* hosts [16:58:12] it's fixed for me now [17:11:24] mutante: you don't need an explicit firewall rule for that because of https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/base/manifests/firewall.pp#70 [21:04:24] could someone merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/891360 for me= [21:04:26] ? [22:12:13] zabe: sure, done! [22:17:46] thanks