[09:14:07] 🥳 [16:03:55] first thought from meeting just now: sounds like it might be tough to put new zuul in the aux k8s cluster. [16:04:11] (the executor/scheduler bits, anyway) [16:08:39] From the description at https://wikitech.wikimedia.org/wiki/Kubernetes/Clusters#aux it already seemed unlikely to me. Combining the risks of CI payloads with "SRE supported critical infrastructure services" does not seem generally the way we have done things around here. [16:18:01] if not aux cluster, then that would leave: plan for new cluster, and then there's a decision about interim state. Or interim/long term host. One other thought: that was sparked by jelto 's question about how to containerize execution: I'm uncertain about how we would use a k8s cluster for workers very effectively. [16:19:54] dduvall: or hashar any thoughts about workers from that call? (though I know a lot was focused on executor) [16:22:20] I'd argue the CI control plane is critical infra [16:23:17] at least I have a better mental model of the scheduler > executor > workers [16:23:40] so the workers can be scaled to whatever, and I think the dumbest way is to mimic what we do currently [16:23:56] use static nodes in Nodepool (similar to how we statically define Jenkins agents in the Jenkins controller) [16:24:16] re: the description of how the executor dispatches jobs, i don't think we should rule out k8s for workloads completely, but from the description of what privileges the executor needs itself, i think we can rules out k8s for the moment [16:24:35] have those static nodes be WMCS instances that have Docker running and whatever basic config (that is how I have set it up on zuul-dev) [16:25:16] yeah, that's something we didn't bring up in the meeting, the fact that all of our jobs are current run via docker [16:25:59] I talked about that with James some years ago and we can use the Ansible playbook that runs test_command [16:26:03] and pass it "docker run whatever" [16:26:11] for a first step that might be sufficient [16:26:23] that's probably sufficient, yeah [16:26:51] another possibility would be to run the current CI images in k8s and via nodepool [16:27:40] so that Zuul would require a resource for a label mediawiki-php81, nodepool will spin up a container from the releng/quibble-bullseye-php81 image [16:27:56] and once the container is ready ansible will be able to run `quibble --whatever` [16:28:07] but we don't have a kubernetes cluster to run those ci images [16:28:08] anyway [16:28:29] the big Q is where we run the Executors :/ [16:29:54] or we run it [16:30:01] but it seems using upstream containers is the easiest [16:30:22] and if I got it right nobody bother with the Helm chart (why would you really?) [16:31:00] and the operator had some traction but is probably barely used [16:31:59] so that leaves out docker compose or well simply `docker run` (potentially using systemd to meet what our monitoring system is expecting and to make it consistent with how we do things) [16:35:26] no stupid questions: docker in systemd: do we have a way to scale the number of instances? Or would we need docker-compose? [16:35:49] docker compose in prod seems weird to me [16:36:12] ditto [16:36:20] I can already hear _joe_ mumbling from the end of the building [16:36:21] :b [16:36:53] we already run docker based things via systemd (like buildkitd) so i think we should stick to that if we're not going to use k8s [16:37:04] there are puppet manifests for it [16:37:19] cool [16:37:40] scaling, well, if not using k8s it requires provisioning additional hosts. this is all sre collab's domain though :) [16:38:13] got it, so no benefit to scaling on a single host? [16:38:28] only by adding additional hosts when running docker [16:38:38] not that i can think of. it wouldn't help with load or failover [16:39:19] ease of upgrades is definitely a joint releng/collab concern, however [16:39:20] gotcha [16:40:08] so we should definitely consider what upgrades look like when using upstream vs vendored vs wmf built images [16:40:24] for systemd it would be bumping a hiera value somewhere vs. k8s it's a helm deploy (which anyone could do) [16:40:41] maybe... [16:40:44] (by vendored images, i mean upstream images imported into our registry) [16:41:19] do we have any of those now? [16:42:51] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/gitlab_runner.yaml#84 is where we bump the buildkitd image ref [16:43:35] hmm, i'm not sure if we have images that are directly imported, but i recall doing `FROM some.example/upstream/image@{digest}` in the past :D [16:43:40] I'll catch up tomorrow, I have to head yoga! [16:43:49] hashar: enjoy! [16:48:38] * corvus waves [16:54:39] hi corvus! [16:57:36] here's how opendev runs, for example, the zuul-executor via docker-compose: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/zuul-executor/files/docker-compose.yaml [16:57:41] all the other components are similarly represented. and there's an overall "zuul" role that sets up some users, permissions, etc [16:57:47] we used docker-compose just to have some isolation from the underlying os (we started doing this when we would have had to choose between upstart and systemd!) [16:57:51] but structurally, starting the container with systemd should be very similar [16:59:20] here's a k8s manifest deployment of zuul: https://gerrit.googlesource.com/zuul/ops/+/refs/heads/master/k8s/zuul.yaml [17:00:27] that has some google-cloud specific stuff in it, but, in general, that file is very similar to the docker-compose files: just "run this container image" with a few mountpoints. [17:01:14] here are the helm charts: https://opendev.org/zuul/zuul-helm [17:01:29] they may be a good starting point, but they may have bitrotted [17:01:50] thanks corvus and welcome! [17:04:28] thanks! [17:05:07] ^ bd808 if you want to do any voiced magic for corvus, he's here \o/ [17:05:21] i like to reference opendev a lot because all the ops are fully public [17:05:44] here's a zuul gate job that runs a complete copy of the zuul system on ephemeral nodes to check all our deployment code: https://zuul.opendev.org/t/openstack/build/2089be9926d24f3b8fd240a5577433a7 [17:07:10] (it runs on 8 ephemeral virtual machines, and spins up an executor, launcher, merger, scheduler, database, zookeeper, and load balancer -- all using those docker-compose files [17:13:31] very cool, that's a nice template to follow. Is there a good reason to keep these services on seperate boxes? The current zuul set up has the scheduler, merger, and jenkins running on a single beefy host (we have a few schedulers to keep up with demand). Seems like that would be akin to executor, launcher, merger, and scheduler on one machine. [17:26:47] !issync [17:26:47] Syncing #wikimedia-zuul (requested by bd808) [17:26:49] Set /cs flags #wikimedia-zuul dduvall +AVfiortv [17:26:51] Set /cs flags #wikimedia-zuul marxarelli -AVfiortv [17:26:53] Set /cs flags #wikimedia-zuul corvus +Vv [17:28:15] why does the bot keep wheel warring on dan's accounts? [17:28:19] * bd808 looks at config [17:29:17] maybe because my nick is not my username? [17:34:00] dduvall: yeah, that is my guess too. The config is using your nick and I think i should be using your account instead. Same with the role I just added for c.orvus. Fix inbound. [17:35:16] thcipriani: it's not necessary to have them on separate boxes. opendev does it because we scale out the vms for capacity perposes (while also running at least 2 of everything for availability). but all-in-one, or any combination is fine. [17:36:07] (there is also a small amount of right-sizing, in that opendev runs smaller zuul-merger vms than others because they max out using fewer resources) [17:36:28] s/than others/than the other services/ [17:39:13] ack, thanks for confirming, our current setup has two of our beefy boxen for redundancy. I'm sure collab folks will have other considerations to think about for splitting services, but good to know our existing split is still a valid one. [17:40:50] !issync [17:40:50] Syncing #wikimedia-zuul (requested by bd808) [17:40:52] Set /cs flags #wikimedia-zuul marxarelli +AVfiortv [17:41:03] !issync [17:41:03] Syncing #wikimedia-zuul (requested by bd808) [17:41:04] No updates for #wikimedia-zuul [17:41:15] wheel war stopped I think :) [17:41:20] :( [17:41:22] :) [17:41:27] haha