[05:42:06] <_joe_> Perforce is not private equity though [07:23:56] It is 2020's: nobody uses Puppet anymore. It peaked circa 2015-2016 as people started moving to Docker/Ansible [07:35:59] I had a colleague that insisted that the way to go was Ansible, in a Docker container. You can sort of see the idea, everyone can easily have the same environment, but the amount of path gymnastics you have to do to make everything show up just right in the container and on your command-line is non-trivial [07:39:56] “With Puppet, we will be providing our customers with access to a product portfolio that enables them to drive innovation on a global scale. " ... I don't know what that mean. It sort of sounds like they bought it by accident and now they're looking for a justification [08:19:12] Emperor: re: https://phabricator.wikimedia.org/T307874 looks like the best case scenario (i.e. "self-healing") [08:20:12] ah yes, good procrastination us ;-) [08:28:20] heheh [08:42:53] _joe_: re Preforce "In April, private equity firm Francisco Partners acquired 50% of Perforce, becoming an equal partner with Clearlake Capital." [08:43:06] <_joe_> sigh [08:43:15] welp. [08:43:34] <_joe_> my experience with private equity was rather fattening the pig to sell it at an inflated price [08:43:56] <_joe_> anyways, yes, jbond we should move to ansible. [08:44:08] yes its not looking posative imo [08:44:45] <_joe_> hashar: "everyone" not meaning "people running a mid-to-large infra" [08:45:41] <_joe_> those either use any of the old dogs (puppet/chef/cfengine/etc) or moved to the cloud and mostly use terraform with pre-configured VM images [08:46:05] <_joe_> ansible's performance when you get over ~ 100 servers was *dreadful* as of a few years ago [08:46:15] * jbond even the mention of cfengine brings back nightmares [08:46:29] <_joe_> jbond: sorry I should've added a trigger warning [08:46:47] indeed ;) [08:47:18] I have dig a bit in Ansible code base, and well I can't say I was any impressed. It really looked like a lot of spaghetti code :] [08:47:34] <_joe_> hashar: lol have you ever seen puppet's ruby [08:47:40] <_joe_> or worse puppet's clojure? [08:47:58] <_joe_> but anyways, puppet had a real chance to be completely dominant [08:48:07] <_joe_> if they did generate abstractions to build docker images [08:48:19] then indeed folks tend to use pre built image and use TerraForm. It seems the whole industry moved up to yet another level of abstraction [08:48:26] <_joe_> and manage their updates/etc using a configuration DSL people are comfortable with [08:48:40] <_joe_> hashar: that is not very different from us using kubernetes [08:48:42] even computers, you neither have to setup hardware or a VM, it is all k8s [08:49:09] <_joe_> we prebuild containers and run them on a scheduling infra via a declarative language [08:49:13] yes yes [08:49:35] <_joe_> there are many advantages to running an immutable infra [08:49:39] I was merely saying that Puppet is more or less obsolete to the way things are done nowadays by most people [08:49:45] <_joe_> oh yes [08:49:51] <_joe_> the completely missed the train [08:49:51] yes puppetlabs really dropped the ball with containers [08:50:04] <_joe_> containers and cloud orchestration [08:50:07] yes [08:50:24] cause you barely have to enforce state (that is frozen by a container image) nor have to setup systems (you use prebuild image), config is handled by Terrraform etc [08:50:32] <_joe_> they opted to rewrite their server in clojure instead [08:50:44] <_joe_> this is a good lesson to anyone who wants to rewrite stuff [08:50:48] <_joe_> like, say, mediawiki [08:50:57] * hashar cough [08:51:44] <_joe_> hashar: I mean it's since "the mythical man month" that it's universally known complete rewrites of established systems typically end up in tragedy [08:52:03] that is heavily point of view based [08:52:15] for the product perspective, surely that can be a tragedy [08:52:41] but for one carrer advancement they can then brag they have experience with Clojure :) [08:52:46] (now I am trolling) [08:53:33] anyway, it should be possible to use Puppet to provision a Docker image. I did that in the old time to provision a VM image from scratch (via `puppet apply` and a bunch of hacks) [08:54:13] so one can sure have a `RUN puppet apply` to craft the image tarball [08:55:02] then it is only using a subset of the Puppet system and there are simpler tools to achieve that (read: Ansible) [08:55:15] so yeah I agree, they kind of missed the containers train [10:38:11] rzl: https://commons.wikimedia.org/w/index.php?title=File:Wikimedia_Foundation_Quarterly_Report,_FY_2014-15_Q4_(April-June).pdf&page=3 [10:49:42] <_joe_> 99,94 seems *very* low [10:49:47] <_joe_> how was that calculated? [10:50:02] <_joe_> ahhh 2014 [10:50:04] <_joe_> lol [10:50:21] <_joe_> yeah that seems optimistic then [10:56:43] https://meta.wikimedia.org/wiki/Wikimedia_monthly_activities_meetings/2015-08 [10:56:53] Also part of monthly metrics [10:57:55] Would be nice to return to some of that. With what we've learned since then about how and what we measure. [10:58:26] But serving cache hits vs app servers doesn't seem like a bad high level category [11:01:36] Considering it as not "up" during minutes where latency or error ratio is above our target, is I guess a bit of a dated model. We could do that differently now. Eg % of requests for the whole month served within X time and whether % of errors is below a stated threshold. Much like our more internal facing quarterly check in slides recently for SLOs. [11:02:26] what is the context of this conversation? [11:25:33] <_joe_> not sure :D [11:25:56] <_joe_> I just saw the link and I was surprised to see such a low uptime number [11:30:21] 99,94% is less than five hours of downtime per year. Not that bad [12:16:52] ooohh the code snippet beside comments in the gerrit review main page is nice too, just noticed it on a large display/window [12:17:21] disappears on a small viewport so YMMV [14:30:17] we have an empty slot for a session at our next SRE meeting [14:30:23] that is, this coming Monday [14:30:25] anyone up for it? [14:54:34] <_joe_> paravoid: me and cdanis can present something about the work on ddos response [14:58:47] _joe_: that sounds awesome, thanks! [15:15:44] !log ganeti4001 updating all firmware revisions [15:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:51] !log ganeti4001 updating all firmware revisions T307997\ [15:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:56] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [15:35:53] ganeti4001 updating firmware is very slow, still in progress. [15:53:04] !log firmware upgrade for ganeti4001 complete T307997 (bios, nics, idrac) and manually confirmed first 10G port is link active (it is) and is set to pxe [15:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:53:09] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [16:32:21] cwhite: do you have a few minutes to advise me about logstash/ecs/python things? [16:33:58] andrewbogott: I do. [16:34:35] My question is regarding this dashboard: https://logstash.wikimedia.org/app/dashboards#/view/13149940-ff6c-11eb-85b7-9d1831ce7631?_g=h@865c245&_a=h@32447ce which comes from nova_fullstack_test.py [16:34:41] (hoping that dashboard link actually works) [16:35:44] That script runs a new test every few minutes, and and I'd like to ask logstash to filter based on a particular test. Since the test involves creating a VM it would be cool to inject the hostname as event.id in each logmessage. [16:36:36] Do I need to include that in every single call to LOG in my code, or is there some way to tell the python logger 'for the time being include this key/value in all upcoming log messages'? [16:37:00] (I hope my problem statement is clear; I'm confident my possible solutions are muddled) [16:40:28] It may not be the cleanest, but adding to the ECSFormatter class may help here. [16:41:05] ok, so put the state in that class and then just call a setter when starting a new test? [16:41:15] I'll give that a whirl [16:41:23] Yeah, that's where I'd start [16:41:46] ok! Is event.id an appropriate field to use for this? [16:42:55] If it's unique enough for your use, I think so: https://doc.wikimedia.org/ecs/#field-event-id [16:44:03] Consider that event.id is a unique value and should probably not be used to group events. [16:45:41] ok, event.id isn't right then, I'll read that doc and pick something [16:46:39] If nothing else fits, an arbitrary label is ok: https://doc.wikimedia.org/ecs/#field-labels [16:46:59] ok [16:47:05] thank you! Will add you when I have a patch [16:54:24] Hm, is there a getFormatter to go with setFormatter... [19:42:53] hey folks - don't forget to submit the SRE Summit survey if you haven't yet :) [21:01:10] !log cp305[23] going offline via T243167 for firmware updates (puppet agent disabled and depooled prior to reboot) [21:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:01:15] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [21:33:54] !log cp50[23] returned to service and all green in icinga, cp50[45] depooling for firmware update [21:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:00] !log cp50[23] returned to service and all green in icinga, cp50[45] depooling for firmware update T243167 [21:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:07] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [21:34:27] wow, so many typos. [21:34:37] !log cp30[23] returned to service and all green in icinga, cp30[45] depooling for firmware update T243167 [21:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:55] !log cp305[45] returned to service and all green in icinga, cp305[67] depooling for firmware update T243167 [22:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:01] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167 [22:28:11] !log cp305[67] returned to service and all green in icinga, cp305[89] depooling for firmware update T243167 [22:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:17] T243167: Upgrade BIOS and IDRAC firmware on R440 cp systems - https://phabricator.wikimedia.org/T243167