[10:43:57] mutante, klausman, effie, hnowlan: gentle reminder to not use cumin1001 for cumin/cookbooks/homer/etc. but cumin1002/cumin2002 instead (see moritzm 's email from before the break for context and the MOTD) [10:45:33] I am still logged in on 1001 arent I [10:45:43] I am running from 1002 for sure [10:46:20] lol volans you had me there for a bit [10:47:55] effie: I check execution logs, not who's logged in where :D [10:50:21] yeah, muscle memory drags me to 1001 sometimes :) [11:01:13] hi there, SRE's infra window is starting now but I still have train steps to run, if you need me to stop please let me know [11:02:01] Heads up, I'm about to deploy a change to Presto, to switch it to use the PKI certificates. https://gerrit.wikimedia.org/r/c/operations/puppet/+/709713 - Presto services may be bumpy for a bit. [11:06:51] <_joe_> jnuche: sorry I saw you just now [11:06:54] <_joe_> please go ahead [11:07:20] _joe_: thx [11:18:10] Presto changes are all deployed. Services should be back to normal now. [13:31:51] jelto: nicely done re: gerrit 991000 [13:34:21] godog: thanks! I'm not a fan of migrating EOL services but this made the migration of the remaining services easier. And removing the misweb design-style-guide service from wikikube again is also quite easy. [13:35:29] jelto: oh I was talking about the change number, +1 on the content/purpose though too [13:35:56] ah ok :D [18:29:09] legoktm re:email list creation | sorry, I just followed the docs on wikitech. It's not working, so I'll delete the list and try again w the Google groups approach as recommended by j-oe [18:31:29] volans: I copied my .bash_history from cumin1001 to cumin1002 to incentive myself to never use the old one again now, ACK [19:54:50] volans: I have a present for you: the reimage cookbook needs locking or something for downtiming the host, see the run from https://phabricator.wikimedia.org/T351074#9463635 [19:55:21] I was reimaging ~5 hosts at the same time [20:05:43] kamila_: if you're referring to the line: [20:05:51] _Unable to downtime the new host on Icinga/Alertmanager, the sre.hosts.downtime cookbook returned 99_ [20:06:03] volans: yep [20:06:20] that's most likely because the puppet run on the alert host failed within the timeout passed to run-puppet-agent [20:06:25] because of the concurrent runs [20:06:27] oh [20:06:29] I see [20:06:57] on the alert hosts puppet is super slow, more than a minute IIRC [20:09:40] in that case the reimage is just calling the downtime cookbook, that currently doesn't have yet opt-in in using the locking arguments [20:09:51] right [20:11:34] we could add the lock only when --force-puppet is set for example [20:11:39] and make it exclusive per-host [20:12:11] in all the other cases there is probably no need to limit the concurrency [20:14:12] if you want feel free to open a task (to not forget) and/or send a patch :) [20:16:00] right, yes, that'd probably work [20:20:52] filed: https://phabricator.wikimedia.org/T355187 [20:21:29] I'm not sure I understand the problem well enough to send a patch, but it might be a valuable exercise :D [20:23:20] thanks for the task kamila_, if you want tomorrow we can pair to do it if you're interested [20:23:31] (or any other day) [20:24:11] sure, why not, thanks :) [20:24:38] thank you for noticing and reporting it ;) [20:26:34] I'm reimaging way too many things and feel awkward about the alert spam, so it's quite selfish :D