[05:30:43] inflatador: I think vitess is _a lot_ more than just a HA solution. I have been reading the last few days you were talking about pacemaker...I think those two are completely different solutions. Vitess will pretty much make you also change the way you handle your data and not just provide HA. Just my opinion. :) [05:47:10] Krinkle: https://phabricator.wikimedia.org/T298769#7604096 [06:29:59] <_joe_> I think that for databases you need to consider at least three dimensions when deciding if you need to do autofailover: [06:30:30] <_joe_> 1 - What is the gain in terms of uptime for a single crash if we have autofailover vs automated-but-manual failover [06:30:46] <_joe_> 2 - What is my MTBF that would be fixed by autofailover [06:31:31] <_joe_> 3 - What is the probability autofailover is caused by excessive traffic, resulting in a thundering herd situation [06:32:27] <_joe_> if you don't have constant failures and autofailover indeed shortens your downtime significantly, the complexity and risk of second-order failures it introduces isn't worth the risk [06:33:00] <_joe_> so if we were managing hundreds of database shards with a relatively high failure rate, I would be the first to say we need autofailover [06:33:28] <_joe_> but we're not, and the MTBF of a db master is pretty high (I would say > 1 year, but marostegui correct me if I'm wrong) [06:33:50] <_joe_> and in the case of analytics, I'm not even sure what our SLO is [06:34:19] I don't even want to say this but...I don't even remember when was the last time a master failed for us, definitely more than a year ago [06:34:23] <_joe_> so, I've used corosync/pacemaker for databases but *disabling* autofailover in the past [06:34:27] <_joe_> marostegui: so today [06:34:28] <_joe_> ok [06:34:30] yes [06:34:30] <_joe_> cool [06:34:31] XD [06:35:41] <_joe_> but that was in places where we had no automation framework nor an established system of database proxies for manual-but-automated failover of the master [06:36:22] <_joe_> In general, I think it's healthy to choose the simplest solution that allows you to respect your SLO, if you have one (which we should) [09:54:31] marostegui, jynus: can I have some reviews on https://gerrit.wikimedia.org/r/c/operations/puppet/+/751726 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/751725 ? (just making sure it's on your radar, no rush xd) [10:10:23] dcaro: sure, will check [10:28:12] dcaro: lgtm'd both [10:28:30] thanks kormat! [10:28:50] dcaro: in general feel free to send data-persistence puppet cleanups to me [10:29:02] kormat: sure! thanks! [10:29:58] jynus: i saw your comment re: wmcs on the second one, but arturo from wmcs already +1'd it (and was the one who deprecated their use of it), so it should be fine [10:30:49] but my comments wasn't about cloud team usage, but vps usage [10:31:55] ah ok. i believe the process used to generate the list of unused classes took that into account, too [10:32:13] T272559 says "compared list of unused modules with the one in the cloud and removed the ones that are used there" [10:32:13] T272559: Unused puppet resources audit, 2021 - https://phabricator.wikimedia.org/T272559 [10:33:23] jynus: yep, saw it, we are not using that role specifically, but a 'subrole' kinda thing [11:58:43] marostegui: how long would the upgrade take and how soon could it start? [11:59:11] Eg in terms of a chunk of days/weeks to move around [12:02:21] Krinkle: I can probably finish the upgrade in like 2 days [12:02:32] Krinkle: And I could start next week (we are still testing a few bits here and there) [12:21:42] marostegui: okay, I'll reply on task, probably fine to just do anytime then within January. We add the schema soon but that only takes a second and can happen anytime. We can postpone traffic until after the upgrade, currently finishing up other Q2 things. [12:24:33] Krinkle: that sounds good to me. thanks, let's follow up on the task then [23:39:23] opcache still corrupting method calls, I'm noticing a pattern that very particular areas of MW (or vendor) code seem to attract these corruptions. e.g. calling of Monolog/isHandling() on API hosts, and on Parsoid hosts the loading of wmf-config/src/PhpAutoPrepend etc. [23:39:30] latest one at https://phabricator.wikimedia.org/T297898 [23:59:43] and another at https://phabricator.wikimedia.org/T245183