[06:38:28] <_joe_> jhathaway: https://en.wiktionary.org/wiki/it%27s_better_to_ask_forgiveness_than_permission [06:38:29] <_joe_> :) [10:34:24] _joe_: Makes me laugh, change control in several former workplaces of mine do not agree :P [10:34:56] <_joe_> topranks: you worked at ITIL places [10:35:05] <_joe_> or "we do blameful postmortems" [10:35:51] <_joe_> at $JOB~1 network was managed with ITIL [10:35:57] <_joe_> with the change advisory board and all [10:36:09] Not exactly ITIL but several telcos/ISPs, so very regimented and strict policy. [10:36:33] <_joe_> I had to write a 3 pages essay, attend a change advisory board meeting, and interact with people on the most horrible ticketing system [10:36:34] I sometimes went rogue and got myself in trouble. Also sometimes went rogue and saved the company :) [10:36:40] <_joe_> then wait a week [10:37:21] <_joe_> and of on the day of the change you'd discover that something went wrong anyways because the CAB didn't coordinate properly different changes [10:37:39] Yeah I know that pain. Has to be done sometimes. Proper CI and peer review really is the way to reduce the overhead tbh. [10:37:51] Oh that used to drive me mad. [10:38:39] <_joe_> the thing above was incredibly O(1) in effort whatever the change might be [10:38:44] You approved my change at this time..... but the data team are rebooting the DB as well (or whatever other change that'd make it impossible to assess status of my thing). [10:38:48] <_joe_> opening a single port to one IP [10:39:07] <_joe_> or redesigning all of our load-balancing [10:40:24] <_joe_> I mean I get why banks move slow [10:40:43] <_joe_> or people designing avionics systems or healthcare monitors etc etc [10:40:48] uh huh. Telco's too, I used to rail against it but I can see why they are conservative. [10:40:58] <_joe_> up to a point, sure [10:41:01] But still they can improve the process I think. [10:41:08] <_joe_> yeah [10:41:39] One of the biggest problems I always found was technical people had to persuade a non-technical audience of the relative risks of something. [10:42:03] And if it went wrong you can't say "well you assessed it and allowed it too" cos they just say "yeah but we don't understand that" [10:42:13] not that anyone technical wants to spend their life on the review board [10:47:45] <_joe_> oh we had "technical" people on the board [10:48:06] <_joe_> former developers turned bureaucrats [10:48:37] <_joe_> their opinion on network changes was almost as relevant as mine on CSS changes :P [10:57:20] my $JOB[-1] was also quite keen on ITIL, complete with CAB, Jira, and so on. Was never really convinced it added anything but paperwork to the process; and we had to do separate comms (and downtime negotiation) with our users anyway [11:42:34] Dream or nightmare? Building a local network webbed across the front of the building. https://medium.com/@pv.safronov/moscow-state-university-network-built-by-students-211539855cf9 [11:45:45] hahaha, that's awesome [11:47:54] Amazing that it lasted until 2013 as well :-) [11:52:12] Krinkle: wow [11:58:25] Ciao Daimona [12:09:45] given the dhcp config rewriting magic in sre.hosts.reimage, is it safe to run multiple reimages in parallel? [12:22:47] Ah man that's classic. [12:22:53] "Having two competing networks was a cherry on the top of it. They were cutting each other’s cables, executing DDoS attacks, stealing equipment, etc." [12:22:55] lol [12:25:39] I'd say broadcast storms were fun to troubleshoot :D [12:29:15] hnowlan: My guess is the answer is yes it's safe, but I think we may need volans to verify. [12:37:18] yeah, I believe the same, and an addition we've certainly had phases were >= 3 people were reimaging at the same time, so even if there's a race window it would be small (or we were just lucky :-) [13:29:11] hnowlan: TL;DR yes it's safe. The most racy step (to be improved) is the downtime on Icinga that requires a puppet run on the icinga host that takes ages and might timeout if too many are piled up. My suggestion is to open different tmux/screen and start the next reimage after 60~100 seconds from the last. [13:30:24] Specifically for the DHCP magic each cookbook run will manage only a single file with the single snippet for the host reimaged. So there is no conflict there, but I can't guarantee 100% that is race free if there are multiple runs within the same second. [13:46:48] "I cannot reimage db1011 more than 20 times per second, my workflow is broken, please fix" ;-) [13:59:32] that article is amazeballs! thanks for posting, Krin kle [14:48:51] volans|off: nice, thank you! [14:48:55] the cookbook itself works great [15:21:57] I'm doing some tests with the puppet compiler jenkins workers, if anyone runs any job it might be a bit unstable, should not take long (let me know if you need to run pcc urgently) [15:41:48] re: reimages - I did do several in parallel a few weeks ago to test. There was eventually a race at the icinga step as discussed above. Staggering a bit would probably save some pain :) [15:42:26] well not a "race" in any real sense, but a timeout issue [15:50:18] <_joe_> we should introduce deduplication of puppet runs there :P [16:06:11] there's a new version of puppet compiler deployed in one of the jenkins workers, it's working as expected, but there might be some things we missed, will leave it for a while, so if you see any weird things going on with puppet-compiler, please ping me or jbond to give it a look, thanks! [16:09:24] the new compiler is named pcc-worker1001 (the old ones are compiler100[12] [16:34:45] i have merged your changed [17:19:09] would you think wasabi.com / https://en.wikipedia.org/wiki/Wasabi_Technologies is a viable alternative to Amazon S3? their entire marketing is that "80% lower cost than S3" so I was wondering who is behind it. looks like they just do storage and nothing else [17:48:39] if someone could review, https://gerrit.wikimedia.org/r/c/operations/puppet/+/745920/, that would be appreciated [17:57:03] jhathaway: did you run the sre.hosts.decommission cookbook on it already? [17:57:29] no, it sounded like I needed to remove from puppet first? [17:58:33] " - Check if any reference was left in the Puppet (both public and private) or [17:58:36] mediawiki-config repositories and ask for confirmation before proceeding [17:58:38] if there is any match. [17:58:40] " [17:58:51] hm from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_to_Spares_OR_Decommission I thought the cookbook happens first [18:00:07] hmm [18:00:38] well, I'll run the cookbook and see what it says [18:03:43] okay I checked a couple of recent decom tickets to be sure, and cookbook-first is correct [18:03:57] the host shouldn't still be *used* for anything, which I suspect is what that puppet reference check is for [18:05:24] you wouldn't want the hostname listed as the destination address for any traffic for example -- but it should still be in site.pp until after, I think the cookbook wouldn't be able to operate on it otherwise [18:06:12] rzl: thanks [18:28:29] jhathaway: rzl: it's a dilemma. if you remove the host from _everything_ including site.pp then the decom cookbook will fail because it won't find the host in puppet. but if you don't remove it first then the decom cookbook will warn you "omg, it's still in the repo". So the real answer is to remove it from everything (DHCP etc) but not site.pp, then cookbook, then site.pp.. or just say "yea, I [18:28:35] know what I'm doing" to the cookbook.. but then you might not notice it is still in some _other_ place [18:28:49] nod [18:28:57] but also this is a special case because "rename of existing host" so.. eh.. not sure :) [18:28:59] I guess the cookbook should exclude site.pp from that check then, no? [18:29:09] yes, it should [18:29:36] sounds straightforward enough - then you can take the warning seriously the rest of the time [18:29:44] that's true [18:30:14] (agree the rename case is special anyway and jhathaway might need to do some stuff differently, but I think that's all still true) [18:30:54] I'll open a task and v.olans can explain how it's more complicated than I think :D [18:31:16] the cookbook found the reference in site.pp, but did not fail the task, so perhaps it is already skipped? [18:32:07] oh, it didn't give you the "found matches in puppet, proceed anyway?" prompt? [18:32:08] not fail but stop and make you say "yes, I am aware" at least [18:33:28] rzl: it did! I should have read more carefully [18:34:00] ahh yeah, that's all :D [18:34:21] if that had matched in most other places in puppet, it would be a really good reason to stop what you're doing [18:34:53] nod [18:35:04] which is why the proposed change -- if we take out the expected matches, then that warning can be phrased with a lot more urgency [18:36:27] (= font color changes to red :) [18:37:15] little USB robot arms to wave at you [18:37:34] please type in the number of regex matches to continue [18:37:34] etc etc [18:38:19] lol https://what.thedailywtf.com/uploads/files/1483566204824-img_0213.jpg [18:41:41] looks like kubernetes1022 is being added as part of the decom switch change, expected? [18:42:00] eh, I dont think so [18:42:32] that's from https://phabricator.wikimedia.org/T294301 [18:42:40] work on it happened yesterday, new racking [18:42:49] but regardless I would not expect that in the diff [18:42:58] hmm [18:43:25] mutante: any idea who I might ask about it? [18:43:55] jhathaway: ideally jclark if he is on dcops channel [18:44:18] mutante: thanks [18:44:19] I think what happened is after they add hosts to netbox they should run the dns cookbook