[09:09:37] <_joe_> I have a question about the admin module: say I have a system user I want to be part of the docker group [09:10:14] <_joe_> so I declare it with the 'user' puppet resource [09:11:05] <_joe_> now if I want to get it in the docker group where I already have users managed by the admin module, what's the correct way to do that? [09:11:13] <_joe_> just add the user to the group in data.yaml? [09:12:09] <_joe_> oh I just need to use system_members, reading the code now [14:30:01] mutante: > Similar to modern Welsh cythrwfl (“uproar, trouble, agitation”) Welsh :) [14:30:43] Thanks for this. I've been a Welsh learner for almost 15 years and didn't know this word. Love it. [14:37:31] jayme: maybe this one! [14:37:41] yes [14:38:12] there's a lot of different objections to a lot of different parts of this running through my head, so it's hard to even make sense of what to focus on :) [14:38:49] but let's start with stepping out of some potential XY-problem scenarios in this conversation, and get at the rationale for the functionality. [14:39:14] what is the purpose of depooling miscweb in a core site while you're not depooling the rest of the ingress on that site? [14:40:43] (it's kind of a leading question, and what it's leading to is that core site redundancy is meant for site-level issues, not other lesser issues) [14:41:17] we sometimes abuse it for maintenance sorts of things, but IMHO that's a pattern we should be finding ways to avoid [14:41:30] clusters should be upgradable and maintainable without a whole-core-site outage, basically. [14:42:09] hmmm that was confusing, let me restate that last line [14:42:14] hmm..it is mostly for maintenance and eeping feature parity with how it currently is [14:42:14] clusters should be upgradable and maintainable without a whole-core-site outage of that cluster, basically. [14:43:25] <_joe_> bblack: I radically disagree with the part about not using depooling for upgrades [14:43:54] <_joe_> why should we make our life more complicated, the procedures more risky, and more importantly add engineering complexity for little gain? [14:43:58] I'm not 100% sure about the definition of "clusters" in that sentence tbh [14:44:27] <_joe_> I assumed "kubernetes clusters" was intended [14:44:41] well, any cluster from the LVS pov was more what I meant, but this would be one. [14:45:55] _joe_: I agree that you can make the pragmatic argument that if you're doing a very quick and reliable maintenance or upgrade, rarely, then the odds of that process interfering with a legitimate site-level outage is low and not worth worrying about. [14:46:04] clusters from LVS pov is still "bunch of k8s nodes" I guess. So what is in lvs.conftool.cluster basically [14:46:25] the k8s level of detail doesn't really come into this, it just makes it more-complicated to talk about. [14:46:26] <_joe_> yeah LVS is just an incident in that context [14:47:35] but it's a slippery slope from there, to effectively not having site-level redundancy [14:48:19] <_joe_> I disagree, and also I hate to break it to you, but we do not have site-level redundancy right now. Ask the DBAs if we can do a mediawiki switchover *right now* [14:48:26] <_joe_> like, in this second [14:48:47] <_joe_> it would need to stop some maintenance we can afford to run in codfw because it's not serving traffic [14:49:05] <_joe_> but if we focus back on the original issue [14:49:29] <_joe_> I can see arguments towards tying individual services to their ingress for simplicity of reasoning [14:50:00] <_joe_> bu from the operational point of view, we might, under specific circumstances, want to serve service-under-ingress-a from a single dc [14:50:16] <_joe_> and tying all services under ingress together in discovery feels "wrong" [14:50:26] <_joe_> OTOH this has nothing to do with maintenace [14:50:38] right [14:50:43] <_joe_> so I'm not sure why the following discussion originated ffrom there :) [14:51:24] <_joe_> one example could be a service that is running very hot and we decide to give it additional capacity in a dc during a specific external event [14:51:34] if we followed my argument to its logical extreme, clearly we never would've had all the disc records separated in the first place. Clearly we have existing needs and practices around depooling them separately, even during actual switchover tests. [14:51:37] if I may interject for a second, from the operational point of view, can we try to get the Icinga config fixed first and then approach the more general problem of what's the ideal scenario in this case? [14:51:39] <_joe_> thus reducing the size of the remaining pools [14:52:22] <_joe_> volans: sorry am I involved in the breakage? [14:52:32] I was [14:52:40] <_joe_> go fix icinga then :P [14:52:41] volans: we're kind of in uncharted territory I think. We either have to try to roll back, or try something novel, basically [14:52:47] volans: I can just remove the monitoring for now [14:52:52] or that [14:53:10] <_joe_> I would do that for now [14:53:42] +1 [14:53:57] <_joe_> jayme: I think at the moment our best bet however is to keep a single discovery record for several reasons [14:54:44] <_joe_> I didn't reason about the need to lockstep records to the availability of the ingress one. [14:54:52] _joe_: I don't disagree that our site-level redundancy isn't great now, but IMHO we make it better by aiming to do better in the future. [14:55:05] rather than reinforcing the patterns that keep us that way [14:55:34] honestly, the expense and complexity of having a second site at all isn't worth it if it's just to ease upgrades and such. [14:55:38] <_joe_> bblack: ok, and my point is that if we depool an entire kubernetes cluster while upgrading it the risk of a full-size outage is smaller and has shortter duration. [14:56:37] <_joe_> bblack: I am saying that if we want to have actual redundancy for mediawiki, we need to first start to be a/a and understand how / if that slows us down [14:56:49] <_joe_> we never have a kubernetes cluster down for more than 1-2 hours [14:56:57] <_joe_> and that happens every what, 6 months at best? [14:57:07] <_joe_> seems like a down budget that's reasonable to me [14:57:38] <_joe_> my judgement is that the alternative approach has a much higher chance of causing an outage [14:59:05] <_joe_> In the context of the fact we have multiple other maintenace processes that are simpler, safer, faster if we depool a site (databases, network, hardware) [14:59:16] <_joe_> this is really a drop in the ocean [14:59:22] I'm thinking more from the perspective of all of these lvs/disc services. there's quite a few of them. if we're not pushing against (ab)using site-level redundancy for routine operations, there will eventually always be something not ready for a site-level outage. [14:59:43] <_joe_> but we don't actually do anything like that? [14:59:54] <_joe_> no one depools a service to deploy it [15:00:26] yeah, and while we might do that to upgrade k8s itself, no service running inside k8s works that way either [15:00:29] restbase used to do it all the time [15:00:58] <_joe_> bblack: I doubt they could, given how scap3 works [15:01:00] if it's not common other than the special case of k8s, then maybe we could declare k8s the exception and have some standards about the rest? [15:01:08] <_joe_> anyways, meeting [15:03:16] anyways, what I'm saying wasn't controversial at some point in the past. We did talk about this at offsites ages ago, about the general philosophy of aiming to only use site-level redundancy for site-level issues. [15:03:26] clearly, it is now controverial! :) [15:06:00] _joe_: "no one depools a service to deploy it" is exactly the standard I'm arguing for. Emergencies, I get. [15:06:20] from the LVS pov, the whole k8s ingress is a "service", but it's a meta-service over many others. [15:07:04] seems like we're in vehement agreement, or at least close to it? [15:07:06] I think this is pretty sensible overall. we do depool a whole site to upgrade the routers, for instance [15:07:07] the layering there is really what turns the existing models on their heads. It would (will?) be cleaner if/when k8s gets out from under lvs entirely. [15:08:03] but there's a lot of steps there I guess, to replicate all the same functionality [15:08:57] anyways, this is all really far off-track from the initial issue [15:09:53] service::catalog is serving multiple purposes that are usually-aligned (an LVS service with a discovery service behind it) [15:10:23] now we're basically saying we want two things to be seperated for monitoring+dnsdisc, but together at the LVS level (same IP:port) [15:10:42] it's a new pattern, we'll have to see if the configuration of it works as expected! :) [15:11:54] it wasn't clear to me this was the case at the outset, because the original patch still had them on two different ports [15:12:46] that was totally my fault! And I do agree that this must have made things pretty confusing [15:13:32] jayme: with the port fixed, the monitoring could theoretically work [15:13:35] and yes, things should get more clean when get's away from under lvs but we need stepps to get there [15:13:51] but what was the hangup about the icinga host entry? [15:14:18] unfortunately not as the problem really is that monitoring.pp does not create the icinga host if there is no lvs stanza [15:14:26] got it [15:14:37] so that's probably fixable [15:16:07] yes, absolutely. But as miscweb.discovery.w is not really used currently anyways, we can figure out a path forward and then decide what and where to fix [15:17:27] yeah, there's some tricky issues there [15:18:16] there's icinga-level monitoring, and also lvs-level healthchecks [15:18:50] but the latter should not be enabled then lvs stanza is missing, right? [15:19:11] yeah, the question is whether we care. I think not? [15:20:02] right now for "k8s-ingress-wikikube", the pybal-level healthchecks only do IdleConnection checks [15:20:28] which is all the one for miscweb did either, so no change [15:20:53] I was thinking there might have been http-level checks on them, but no [15:21:27] does dnsdisc rely on service::catalog now as well, or is that still relatively-independent? [15:22:23] no, we don't care for the health-check case. Ingress will take care of that k8s-internally. So it's fine to just rely on idleconnection for "k8s-ingress-wikikube" [15:23:12] right [15:23:44] re: dnsdics that I'm not sure. I always was under the impression that at least the discovery: stanza is needed in service::catalog [15:24:27] yeah it is, I just finally tracked through all the puppet code [15:25:05] otherwise I was going to argue that we could just have an array of many monitoring entries under the single lvs-level service [15:25:29] IMHO we should document a bit better all the magic/indirection that comes down from service::catalog, it's hard to reason about it and generates quite some confusion [15:25:38] agreed! [15:25:58] I thought we actually aim to have all "services" service::catalog - with their own entry [15:26:10] *in service::catalog [15:26:15] the problem is the vague definition of "service" keeps evolving! :) [15:26:23] agreed... [15:27:04] but yeah, there's a lot of metadata in s::c, and it's not always clear which parts drive which functional bits where [15:27:05] I liked your wording of ingress being a meta-service :) [15:27:28] in some sense, "text-https" is also a meta-service this way [15:27:40] yeah, right. [15:27:56] AIUI monitoring: is going to go away anyways (replaced by probes:) [15:28:42] but in the equivalent text-https case, we've never tried to have the functionality of e.g. "depool just wikidata.org from eqiad", and also never tried to monitor all the separate hostnames/backend services at the icinga level through s::c [15:29:12] (which would be quite complicated anyways in that particular case!) [15:30:56] if it were just the monitoring part at issue, I'd say either of (a) add multiple monitoring entries to a single LVS service, or even better (b) just define k8s sub-service icinga monitoring elsewhere outside of the overloaded service::catalog stuff. You'll want it separate eventually anyways as you migrate away from LVS eventually. [15:31:48] but if you're doing dnsdisc per-service as well, that's probably more difficult to get away from having an s::c entry for each subservice [15:32:27] with the icinga monitoring being replaced by prometheus, that dependency to the "lvs" key is going away automatically. Those checks are happy now (with the port fixed) [15:32:49] ok [15:33:40] going back to the whys: the reason service::catalog grew into what it is today, was that people were (rightly!) unhappy with how complex it was to provision new services. There were many redundant data commits in many places to define the same "service" at various layers and levels. [15:34:08] so I'm hesitant to just back out of that model and go back to having dnsdisc driven separately, either. [15:34:30] at least, not now. maybe it will be different after k8s takes over many of the existing entries and makes it a much smaller list of top-level services [15:38:07] <_joe_> yeah, no [15:38:15] <_joe_> and the problem is deeper than that [15:38:27] <_joe_> dnsdisc now assumes every record is independent [15:38:39] <_joe_> having dependent records as is is a huge footgun [15:39:03] <_joe_> we could try to depool ingress and actually still send traffic there [15:39:10] right [15:39:43] I mentioned that at the outset, but then I got lost in the whole redundancy argument anyways :) [15:39:47] <_joe_> so if we want to have independent records, we'll need to rethink that a bit [15:40:02] <_joe_> so for now, we have no choice really [15:40:03] or we could add some kind of dependency mechanism to dnsdisc [15:40:14] <_joe_> yes, but not today :) [15:40:38] we don't have to find a "final" solution today really [15:42:29] but it's absolutely right that we need to prevent someone from depooling ingress without knowing :/ [15:44:26] <_joe_> jayme: so I think you need to actually make miscweb.discovery.wmnet a CNAME to the ingress discovery record now [15:46:19] if we follow that model going forward, then we really could just tack all the sub-service probes under one lvs entry [15:46:36] but we won't I guess, because we want sub-services with independent dnsdisc failover [15:46:52] so yeah, I guess it's a temp fixup [15:51:24] I could do that. But as it is not used currently, I could also leave it be until we come to a conclusion on how to move forward [15:52:51] as there are more things already in line to go under ingress - I guess we'll have to come up with something anyways [16:04:47] o/ just tried to use profile::docker::engine on a debian bullsye host in deployment-prep [16:04:59] E: Unable to locate package docker-engine [16:05:07] has the package name changed in bullsye? [16:05:17] yeah [16:05:22] docker.io? [16:05:30] https://packages.debian.org/bullseye/docker.io [16:05:30] ? [16:05:39] ottomata: yes [16:05:41] yes [16:06:11] and that changed happend at bullsye, ya? i'll make a patch [16:06:27] I think so? [16:06:28] oh! [16:06:32] profile::docker::engine::packagename [16:06:37] already has a setting! :) [16:07:00] yeah it seems to be used lots of places [16:07:04] ah perfect okay this has already been done, thank you! [16:07:08] just gotta set it right in horizon [16:08:04] bblack: _joe_: I'll leave the miscweb.discovery.wmnet record untouched for now and will try to summarize all the above into a phab task tomorrow - if that's fine with you [16:11:16] sounds good to me! [16:11:35] although, the CNAME solution is an option too [16:11:56] either way we're in a temporarily-unideal state, but at least the CNAME one forces the dependency [16:12:17] if it's not currently being relied on, leaving it alone is simpler though :) [16:20:49] ack. We're currently using k8s-ingress-wikikube.discovery.wmnet to route traffic there https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/common/profile/trafficserver/backend.yaml#224 [16:55:17] linking this here for visibility: https://phabricator.wikimedia.org/T304089#7819217 (TL;DR - we're going to attempt esams->drmrs failover tomorrow and keep an eye on it for ~24h). [16:56:15] * volans takes tomorrow off, just in case :-P [16:56:16] esmas will be depooled with the "normal" mechanism with an admin_state commit in ops/dns, and that commit can be reverted to end early if anything oes amok. [16:56:31] :P [16:58:08] sweet [16:58:14] there are some other things that could go wrong besides transit saturation, it's always hard to predict all the things [16:58:34] one would be an actual routing problem (e.g. some sizeable ISP's customers start complaining they can't reach us at all) [16:59:05] didn't realize we were already pushing ~3Gb/s out of drmrs [16:59:27] yeah we turned on PT, ES, and FR to get more traffic + cache fill last week [17:02:57] luckily, we have NEL to look at for the potential routing problem, hopefully that will tell us faster than user compaints [17:03:11] *complaints [17:26:31] ooh exciting [18:36:01] volans: lol, I'm luckily off tomorrow also :P