[15:53:29] There was some discussion about the k8s driver, and using it to get a namespace from nodepool (where the job could then launch its own pods from specific images) instead of getting a running pod (which would act like a vm) from nodepool. [15:53:33] Here's a link to the documentation: https://zuul-ci.org/docs/nodepool/latest/kubernetes.html and the small example at the top shows an example of configurating each type of thing. It's as easy as just telling nodepool you want a "namespace". I don't have any examples of any jobs that use that though (OpenDev doesn't do anything like that). But it's pretty straightforward to tell Ansible to [15:53:39] I added a section about "Container Images" to the design doc; that has links to the dockerfile and job definitions, etc. [15:53:43] perform k8s operations. You would just use the k8s module to, for example, tell it to launch a pod with an image: https://docs.ansible.com/ansible/latest/collections/kubernetes/core/k8s_module.html [16:00:35] o/ [16:00:36] :) [16:04:00] a lot of the Jenkins job merely do a single `docker run`. So the environment is setup via docker images , the one to test Mediawiki with PHP 8.1 is https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/dockerfiles/quibble-bullseye-php81/Dockerfile.template [16:04:26] there is a parent layer which installs our test system (Quibble), more or less similar to devstack but with way less functionalities [16:05:06] thus my guess is that on the Executor we would ask Kubernetes to run that image with some environments variables and args to the container [16:05:45] dduvall advocated to build an experiment [16:15:01] yep, a zuul job structure that is roughly like this should work: 1) build a new image with the source code change; 2) get a k8s namespace; 3) run the newly-built image as a pod in k8s [16:17:42] the images are frozen though, we don't build them automatically [16:17:47] so that saves time on each build [16:17:54] and ensure we run builds with the same environment [16:18:09] oh, then you could just use the k8s pods support in nodepool then [16:18:16] probably yeah [16:18:38] did you get access or given a tour to our OpenStack system? [16:19:05] there is a manually installed Zuul instance at https://zuul-dev.wmcloud.org/ [16:19:33] not sure if access is finalized; no tour yet [16:21:30] do you have an account at https://idm.wikimedia.org ? That might be simply `corvus` [16:21:37] the email matches at least [16:22:41] yep that's me, and i agreed to all the things; just not sure if i've been added to the group yet [16:23:04] part of the issue is we have two realm [16:23:34] production requires signed paper work, nda, server level agreement etc [16:23:53] that would ultimately grant access to VM managed by Ganeti which are withing the production cluster [16:24:16] the other realm is WMCS (WikiMedia Cloud Services), which is an OpenStack that is also used by tech savvy volunteers [16:24:21] (and staff) [16:24:35] https://zuul-dev.wmcloud.org/ is setup on the later [16:26:21] !log Added corvus to zuul3 WMCS project [16:26:21] hashar: Not expecting to hear !log here [16:26:29] :D [16:27:03] corvus: if I did it right you should have access to the zuul3 OpenStack tenant via https://horizon.wikimedia.org/ [16:31:56] hashar: confirmed [16:32:19] https://horizon.wikimedia.org/project/instances/ should show you three instances [16:32:24] zuul-1001 has the scheduler/executor [16:32:35] node1001 is a static node I have created to play with configuring a static node [16:32:40] and microk8s is one I have created just now [16:33:19] to access them you'd need to generate a ssh keypair and upload it to idm.wikimedia.org [16:33:27] that ssh key pair should be dedicated to *.wmcloud.org [16:34:16] cause to access the instances you must pass through a bastion (`ProxyJump bastion.wmcloud.org:22`) which mean that potentially some of our root could still your credentials to then connect to wherever you have accesss [16:34:19] (iirc) [16:34:23] the doc should be https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances [16:35:13] I already made you a member of the zuul3 tenant :) [16:56:37] dduvall: I have created a microk8s instance [172.16.16.203 / 2a02:ec80:a000:1::84 ] [16:56:49] microk8s (1.26/stable) v1.26.15 from Canonical** installed [17:08:08] https://phabricator.wikimedia.org/T395826 !:-] [17:10:37] for the mariadb database request, we will need to have some estimate for queries per second and size. and define if and how often we want backups. So far I have zero idea how we would determine QPS and size. Would you? [17:26:47] back in the day nodepool used two connections by VM [17:27:01] so I think I had a max of 50 instances and e had mariadb set at 100 [17:27:21] but corvus would surely know how many simultaneous connections and rate of queries to expect [17:27:35] I don't think it will be any large nor a concern for our DBA [17:27:40] I am off! [17:27:43] dinner time [17:27:57] thanks. yea. it's just things that are asked by default .. it's a template to fill out [19:07:12] first: it's very low traffic so it almost certainly doesn't matter :) [19:07:19] the actual numbers are going to vary a lot depending on the system size and workload. each scheduler and web instance has a connection pool, so could end up with something like 10 each. in practice, probably no more than 1 each most of the time. somewhere between 10-50 sounds like a good value to put on the form. [19:07:25] queries per second depends on workload and what users do (how often they look at build history). maybe take your current jobs/second that you run and multiply that by 4 (there are at least 2 writes and one read for every job run, plus two more writes for every change enqueued, plus whatever is needed to satisfy user searches). [19:07:31] for size: opendev uses an average of 1231 bytes for each build record it has. so maybe: 1300*(builds/day)*days_retention, or at least 100MB, or larger if it's easy [19:22:47] mutante: ^ so basically usage would be epsilon/peanuts/barely noticeable :) [19:23:11] (as far as I understand DB and relatively to en.wikipedia.org database usage! ) [19:24:14] thank you both, alright! [19:26:17] mutante: contint / Jenkins has 61000 stored build. The quibble ones making half of that (27500) [19:26:37] the expire after 7 days [19:27:05] so even if we kept them for 4 x more times that would be less than 200k builds [19:27:33] and I guess as many records in the db with queries certainly having indexes /primary keys etc [19:28:49] ok! :) [20:14:42] * hashar admires the beauty of Zuul config errors: https://zuul-dev.wmcloud.org/t/wikimedia/buildset/989690a4922b4dbf8ac6c2acea9371c9 [20:15:14] once fixed (I had ot move the playbooks from ./zuul.d directly to ./ [20:15:16] https://zuul-dev.wmcloud.org/t/wikimedia/build/3af7b79bd4884d40bcfdcb9d491adc3f [20:30:56] hashar: yay! we try very hard to make them useful. if possible, it will try to leave a comment in gerrit at the specific line. i'm guessing the config isn't completely set up yet which is why it isn't showing up there at the moment. [20:37:44] yeah most certainly [20:38:17] late evening hacking is cool on a known system, but right now I have a few stacks to digest (kubernetes/ansible/zuul/nodepool) :) [20:38:34] https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/1152814 [20:38:52] I have tried to use a batch/v1 Job [20:40:27] and the launcher trace https://phabricator.wikimedia.org/P76868 [20:40:33] something about k8s namespace creation :) [20:52:34] i'm guessing a problem with the launcher and the k8s credentials... there might be something earlier in the log (closer to startup or the last time it reloaded the config?). or we might need to take a look at the nodepool config file. [20:53:01] might be time for someone to walk me through logging into the host where nodepool-launcher is running so i can inspect those? [20:53:15] ah yeah sorry cause I went for dinner earlier :b [20:53:24] then came back only to jump into brute force hacking mode [20:53:52] heh no need to apologize :) [20:55:16] I pasted some explanations earlier at 16:32 UTC [20:55:28] then the canonical doc is https://wikitech.wikimedia.org/wiki/Help:Accessing_Cloud_VPS_instances [20:55:46] and the account management is done via https://idm.wikimedia.org/ [20:55:59] where you'd need to upload a freshly generated keypair [21:10:53] i uploaded a key there, and hit the activate button, but it says it's not active. is there a delay? [21:11:13] hmm [21:11:18] ssh -J corvus@bastion.wmcloud.org corvus@zuul-1001.zuul3.eqiad1.wikimedia.cloud [21:11:21] corvus@bastion.wmcloud.org: Permission denied (publickey). [21:12:10] https://ldap.toolforge.org/user/corvus [21:12:14] I don't see the ssh key there [21:12:23] compared to https://ldap.toolforge.org/user/hashar [21:12:27] so yeah I imagine there is some delay [21:13:50] but why [21:13:57] cause as soon as you POST, that should write to LDAP [21:14:59] corvus: and your keys are showing up at https://idm.wikimedia.org/keymanagement/# ? [21:16:33] yes; i'll send you a screenshot out of band [21:16:59] our stack has some fun oddities sometime [21:18:43] no judgement from me. when i worked at the fsf, this would have been a couple of days round trip with RT tickets... :) [21:21:09] yeah that sounds familiar [21:21:29] I guess in most companies it ends up being complicated, and at some place you are lucky if you even get a laptop on the first day [21:21:41] so somehing is broken in our identity management system [21:22:05] or you need to be activated manually somehow [21:22:20] I'll ask in #wikimedia-cloud-admin which has the team managing our openstack cluster among other things [21:22:48] to try to keep the other thing moving in parallel: maybe you can look at the nodepool logs for k8s authentication issues. i think in this setup, it should be trying to load from ~/.kube/config. maybe it's something as simple as needing to bind-mount that file into the container. [21:28:24] oh man [21:29:26] I have put it under /zuul/etc_nodepool/kube/config [21:30:57] so it is visible in the launcher as /etc/nodepool/kube/config [21:32:48] I am filling a task for cloud admins to investigate [21:33:03] (that last sentence is for your account / ssh key ) [21:39:40] corvus: I have filed it in our bug tracker https://phabricator.wikimedia.org/T395857 [21:40:10] hashar: okay, nothing about what i'm about to say is good; we'll be fixing a lot of this in the nodepool-in-zuul work. but.... [21:41:28] nodepool doesn't have an option to specify the k8s config file location; and i don't think the python library has an env variable to set it either, which means it's only going to look in ~/.kube/config [21:41:57] the nodepool container runs as the root user by default, so i think that would need to actually be /root/.kube/config inside the container [21:42:16] ahhh [21:42:17] there is a nodepool user built into the container, so if you choose to run as that user (which is not the default, but it *is* how opendev runs) [21:42:27] I was following the Nodepool doc that says: "Before using the driver, Nodepool either needs a ``kube/config``" [21:42:30] then that would be /var/lib/nodepool/.kube/config [21:45:36] yeah that does seem to be misleading [21:45:40] oh good news [21:45:54] i think the KUBECONFIG env variable can be set to point to it [21:46:01] https://github.com/kubernetes-client/python/blob/master/kubernetes/base/config/kube_config.py#L48 [21:46:10] magiiiiic [21:46:16] so you should be able to put it wherever and thet set KUBECONFIG; that may be easier [21:46:43] on it! [21:48:37] after that I will have to find how to rebuild the launcher service :b [21:51:20] any particular reason? [21:51:50] so that the entry point changes nodepool-launcher from using -f to -d [21:57:13] you are a magician urllib3.exceptions.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: Hostname mismatch, certificate is not valid for 'microk8s.zuul3.eqiad1.wikimedia.cloud'. [21:57:15] that is great [22:11:15] I have regenerated it and [22:11:16] https://zuul-dev.wmcloud.org/t/wikimedia/build/43537ba42801483496874313b2c07a1d [22:11:20] tada! An error! [22:27:17] corvus: regarding your ssh key, you'd want to generate one with a different type to workaround some issue in our tool https://phabricator.wikimedia.org/T395857#10877689 [22:27:32] why it does not recognizes ssh-ed25519 ? I have no clue [22:43:42] well [22:43:43] https://zuul-dev.wmcloud.org/t/wikimedia/build/db1791a68e324af1b6dd5c2365eadf88/console [22:43:57] https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/1152814 [22:44:01] that is the source change [22:44:20] it creates a playbook using kubernetes.core.k8s and a Job.batch [22:44:29] configured using https://kubernetes.io/docs/concepts/workloads/controllers/job/ [23:18:28] yep, that's a pretty good framework for a namespace job; but i think the wait and wait_condition need to be at the same level as "definition" [23:19:15] i think we'll want to look into that as a "pod" style job instead of namespace. one of the big benefits of that is live streaming logs. [23:19:51] so for a pod, you would just specify the "docker-registry.wikimedia.org/releng/commit-message-validator:2.1.0" image for the pod in nodepool.yaml [23:20:13] then there are some roles in zuul-jobs for getting the git repos into the pod [23:21:06] https://zuul-ci.org/docs/zuul-jobs/latest/general-roles.html#role-prepare-workspace-openshift [23:21:48] don't worry about "openshift" -- it is named that way because it uses the openshift client, which has a "synchronize" command. but it works with plain k8s. [23:22:21] that's built into the zuul-executor image, so you don't need to install anything extra [23:27:10] corvus: openshift I guess is similar to podman/buildah: those are the RedHat forks? [23:27:31] I started with a namespace, but yeah that can be revisited with pod [23:27:40] the commit is https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/1152814 [23:28:22] feel free to hijack or copy paste it to another change :] [23:28:45] I am going to bed. it is 1:30am here and well.. That evening hack turned out to be a looong one :b [23:35:35] hashar: yep; openshift client ("oc") is fully compatible with kubectl. goodnight! :)