[13:59:17] <thcipriani>	 The migration to OpenLDAP happened in 2015, so this should only affect accounts which have been created a decade ago and are now adding an SSH key while they had not configured any before.
[13:59:40] <thcipriani>	 ^ meant to put a ">" in front of that :)
[14:00:10] <thcipriani>	 anyway: corvus you should be able to login to the zuul3 vms now :)
[14:06:50] <hashar>	 what was wrong? :)
[14:07:12] <hashar>	 and indeed I see the ssh key at https://ldap.toolforge.org/user/corvus  \o/
[14:10:17] <thcipriani>	 very old account, I guess, was what was wrong
[14:25:31] <hashar>	 all lower case <> camel case
[14:25:37] <hashar>	 fun :-]
[16:03:53] <hashar>	 I need to check a few things IRL (bakery/kids) and will be back
[16:03:58] <hashar>	 in roughly half an hour
[16:33:05] <hashar>	 back
[16:33:48] <hashar>	 to continue about where to run the Executor and what they can run it seems the issue is running untrusted code inside the production network (even if there is isolation with bubble wrap)
[16:33:58] <hashar>	 that is what I have carried from the conversations we had earlier
[16:34:50] <hashar>	 so tentatively we might check whether there is a way to disable running arbitrary ansible playbooks, but at a quick glance that does not seem to align with what Zuul does
[16:35:07] <hashar>	 which is by default everything is untrusted and thus contained via bubble wrap
[16:43:31] <hashar>	 then the Ansible playbook would run with Bubble wrap
[16:43:32] <hashar>	 on the executor which is containerized by Docker (albeit privileged to allow Bubble wrap)
[16:43:32] <hashar>	 the Docker daemon being on a Ganeti VM (Qemu?)
[16:46:43] <hashar>	 So I guess one of:
[16:46:43] <hashar>	 1) turn off untrusted playbooks (which dramatically reduce the usefulness of the system)
[16:46:43] <hashar>	 2) enhance the containerization (maybe Bubble wrap and/or Docker can use some tuning), can we isolate the network?
[16:46:43] <hashar>	 3) move the Executor and their arbitrary code execution to another executing environment (k8s on top of WMCS or similar)
[16:56:23] <corvus>	 good summary; i don't think there's anything we can do quickly with zuul for #2 (if there is something useful to be done, that's likely a long process for an enhancement, not something i'd recommend waiting on for deployment).  but if there's something you can do at the host/network level (iptables? auth proxy? etc) that might have a faster turnaround.
[17:12:21] <dduvall>	 hashar: i think #3 introduces more problems than it solves. for instance, i don't know how we could safely provide registry credentials to jobs that are being handled on an WMCS host 
[17:13:58] <dduvall>	 corvus: re: #1 would making use of https://zuul-ci.org/docs/zuul/latest/tenants.html#attr-tenant.untrusted-projects.%3Cproject%3E.exclude be sufficient to disable loading of configuration from a given project?
[17:21:14] <corvus>	 dduvall: yes, but actually "include: []" is the simpler form.  that will prevent projects from defining their own jobs, so they can only be defined in trusted repositories.  so you can centrally define your jobs, jjb-style.
[17:21:52] <dduvall>	 got it
[17:23:35] <corvus>	 we should be able to come up with some pipeline definitions that would allow speculative execution of job definitions only after they have been reviewed.  so we could use "include: []" on most projects, but then we could have the project where the jobs are centrally defined have an extra pipeline that allows the running of reviewed-but-not-merged jobs, so that it's easier to make changes to these 
[17:23:41] <corvus>	 central job definitions.
[17:24:26] <dduvall>	 that sounds better than what we have currently at least
[17:24:31] <corvus>	 yeah, that may be a sort of middle ground where we can reduce the work needed to maintain the jobs by allowing certain users to speculatively run changes to jobs
[17:24:49] <corvus>	 (but still not open it up to "internet facing arbitrary code execution")
[17:24:56] <dduvall>	 right
[17:29:08] <dduvall>	 thcipriani: i think we left the user-facing config (and whether we want folks writing ansible for their own job definitions) out of scope for this project, so while we're leaving some of the flexibility of zuulv>3 on the table, we would still be accomplishing our primary goals of 1) getting off of the deprecated zuulv2/python2 system; and 2) maintaining the dependent pipeline parts of zuul that we found to be indispensable
[17:29:35] <dduvall>	 that's important to keep in mind i think. good news, bad news :)
[17:30:46] <dduvall>	 or Fortunately, Unfortunately for folks that appreciate that children's book
[17:33:02] * thcipriani catches up
[17:33:56] <dduvall>	 i vote option #1 since it unblocks us while maintaining our original focus, and we can always explore option #2 or other hardening measures down the road if we decide we want to try and achieve user-provided ansible in a way that satisfies security concerns
[17:41:19] <thcipriani>	 I agree that option #3 seems like a fraught path. Option #1 does achieve the primary goals, albeit by limiting zuul. But the limitation is one we already have, so it seems like the best compromise to make, to me.
[17:44:39] <thcipriani>	 for option #3, when I say "fraught" I mean: (a) we can't have the only executor in wmcs (as dduvall points out, we need to be able to push to the registry, so no way to share those credentials safely) (b) if we have executors in wmcs and prod then we need to open up zookeeper to wmcs, which seems like something we don't want to do (afaiu), or we have two parallel zuul systems: one for
[17:44:41] <thcipriani>	 prod/one for wmcs and that seems like a weird idea.
[17:46:07] <thcipriani>	 option #1 seems like pretty much what we've got now and that's what we need in the near term
[17:51:05] <dduvall>	 even if exposing a prod hosted zookeeper to a wmcs hosted executor was feasible (which i don't think it is), i don't understand how we would have two executors with distinct restrictions listening on the same zookeeper queue. maybe corvus can clarify but that doesn't seem to be the model in general. isn't the purpose of having multiple executors for scaling and availability, not for distinct security domains?
[17:51:32] <corvus>	 2 things:
[17:52:56] <corvus>	 first: sharing the registry credentials safely may be something that can be accomplished.  opendev has several examples of that.
[17:54:06] <corvus>	 second: the executor "zone" feature in zuul is designed for locating executors in specific network domains (mostly so they have access to isolated worker nodes).  so that is possible, just with the caveat that they need to be able to reach zookeeper and gerrit/gitlab.
[17:55:04] <corvus>	 https://zuul-ci.org/docs/zuul/latest/configuration.html#attr-executor.zone is how that works
[17:55:12] <dduvall>	 ok, cool. thanks for clarifying
[17:56:07] <dduvall>	 our registry restricts pushes to only our production network as well, so credentials are only half of it. the zone thing looks interesting though
[17:56:48] <thcipriani>	 yeah, this is interesting, but dan's "even if exposing a prod hosted zookeeper to a wmcs hosted executor was feasible (which i don't think it is)" is probably the biggest obstacle to that idea.
[17:56:57] <corvus>	 (just let me know if we need to get into the weeds of credential access; i feel like the network stuff is the simplest determining factor right now though)
[17:57:00] <dduvall>	 how does one select which zone their job runs in and is that configurable in an untrusted project?
[18:00:31] <corvus>	 dduvall: via nodepool, and it is configurable in an untrusted project, so we may have difficulty using that as part of an access control mechanism.
[18:01:03] <dduvall>	 ok
[18:01:06] <corvus>	 (zuul job requests a certain nodepool label, the nodepool label has a zone attached, the resulting node tells zuul it needs to use a certain executor to run the job)
[18:04:14] <corvus>	 related but different subject: we were discussing other ways of allowing access to certain nodes by only specific projects: in the case of a node accessed by ssh (either a cloud vm or static node of some kind), in addition to the "global" ssh key that zuul uses for the initial log in to worker nodes, zuul generates a private per-project ssh key.  so it is possible to restrict access to specific 
[18:04:20] <corvus>	 worker nodes to jobs when they're only running for specific projects.
[18:14:44] <dduvall>	 nice, ok. i think the node/project access control becomes more important when untrusted config is allowed, but this is all important to keep in mind for down the road
[18:17:54] <dduvall>	 so a job requests a label and a label has an attached zone... i see that tenant config has `tenant.allowed-labels` but that's tenant wide. theoretically if there were an `allowed-labels` at the project level, perhaps that could function as a proxy for zone control?
[18:18:35] <dduvall>	 (not proposing a new feature now but just trying to think through future possibilities)
[18:19:42] <dduvall>	 in any case, i really think option #1 (central job repo, disabled untrusted project config) is the way to go right now
[18:20:26] <corvus>	 correct, but, tbh, i'm not sure that fits with the zuul configuration model; certainly not with the current nodepool system.  if we did make a change like that, it would have to be with the in-progress nodepool-in-zuul work.  i definitely agree there is a missing feature of some kind, and something like that is a potential resolution.  i'd need to think a bit more about whether that's the best way 
[18:20:32] <corvus>	 to close the gap or something else.  but yes, regardless, no short-term fix there.
[18:21:17] <dduvall>	 cool
[18:21:42] <dduvall>	 that makes sense to me and i appreciate the clarity
[18:22:30] <corvus>	 (to elaborate a bit as an aside: we try really hard to keep as much configuration in-repo as possible, so having the definition of labels/nodes/etc be in-repo (as it will be with nodepool-in-zuul) but then having a restriction on what project can use them be out-of-repo (in the tenant config file, like where the tenant.allowed-labels setting is) is awkward.  so we'd at least want to try to find a 
[18:22:36] <corvus>	 way to put that in-repo and still have it be effectively access-controlled.
[18:23:34] <corvus>	 (nodepool-in-zuul will actually probably drop tenant.allowed-labels because it will be implicit due to whether or not the label appears in the in-repo config for that tenant)
[18:25:13] <hashar>	 I am back around :)
[18:25:16] <corvus>	 (there's a pretty reasonable chance we could add "allowed-labels" (or "disallowed-labels") to a project definition, and make it non-overrideable; so if that is set in a config project, it can't be changed in an untrusted project; that's probably how we'd go about it)
[18:26:49] <corvus>	 (that's actually not too bad from a feature implementation standpoint; if we burn down to that being the last issue -- i'd say there's a reasonable chance that could be done in a reasonable timeframe :)
[18:27:45] <dduvall>	 corvus: nice!
[18:30:26] <dduvall>	 are there currently other config items that work like that? (if defined in a config project cannot be overridden in an untrusted proejct)
[18:31:48] <corvus>	 yes, there are a couple of attributes on the project definition that behave like that already; example: https://zuul-ci.org/docs/zuul/latest/config/project.html#attr-project.default-branch
[18:32:15] <corvus>	 so that approach would slot in fairly well
[18:32:30] <dduvall>	 neat
[18:32:33] * dduvall reads more
[18:33:09] <corvus>	 "Each project may only have one default-branch therefore Zuul will use the first value that it encounters for a given project (regardless of in which branch the definition appears). It may not appear in a Project Template definition."  being the magic words there
[18:33:23] <corvus>	 merge-mode, and queue are similar
[18:34:48] <corvus>	 example usage: https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L29
[18:35:15] <corvus>	 that occurs in a config project which is first in the list of projects for zuul to read config from, and is controlled by the ci system administrators
[18:35:49] <corvus>	 so that forces every starlingx/* project to be in the "starlingx" queue, and that can't be overridden by in-repo config from the starlingx folks
[18:36:56] <dduvall>	 awesome, thanks
[18:37:10] <corvus>	 (the tenant config that lists that repo first: https://opendev.org/openstack/project-config/src/branch/master/zuul/main.yaml#L120 )
[19:46:23] <hashar>	 to achieve "#1 turn off untrusted playbooks"
[19:46:23] <hashar>	 we would go with `tenant.untrusted-projects.*.include: []`
[19:46:23] <hashar>	 which, if I read it properly, means Zuul would not load any items from the projects?
[19:46:24] <hashar>	 ref: https://zuul-ci.org/docs/zuul/latest/tenants.html#attr-tenant.untrusted-projects.%3Cproject%3E.include
[19:46:59] <hashar>	 corvus: is that correct?
[19:47:30] <hashar>	 (I like dduvall summary of the goals: 1) getting off of the deprecated zuulv2/python2 system; and 2) maintaining the dependent pipeline parts of zuul that we found to be indispensable
[19:55:25] <corvus>	 hashar: yes, but take a look at the second example at https://zuul-ci.org/docs/zuul/latest/tenants.html#tenant
[19:55:34] <corvus>	 switch that "exclude" to "include: []"
[19:55:40] <corvus>	 and then it applies to all the projects below
[19:56:14] <corvus>	 (the tenant config yaml format is a little flexible in that you can create these anonymous project groups to apply include/exclude to the whole group)
[19:56:51] <hashar>	 ahh yeah indeed
[19:56:53] <hashar>	 damn yaml
[19:57:06] <corvus>	 we try to use it to good effect :)
[20:00:25] <hashar>	 iirc bd808 wrote a PHP implementation of YAML
[20:01:52] <hashar>	 corvus: regarding pod and having the containers in "pause" state, I could not find any indication of that in nodepool code or in examples. But maybe that is the wrong semantic
[20:02:01] <bd808>	 https://pecl.php.net/package/yaml
[20:04:09] <corvus>	 hashar: ah, i misremembered, it's sleep, not pause: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/kubernetes/provider.py#L357
[20:04:35] <corvus>	 so that should happen automatically; but if it it doesn't work, we can take more control and specify the command ourselves.
[20:04:49] <corvus>	 (also, we should probably change that to "sleep infinity" :)
[20:05:05] <hashar>	 but is it posix!?
[20:05:27] <hashar>	 so nodepool does it for us by overriding the default entrypoint, nice
[20:05:53] <corvus>	 (re posix, maybe not, in which case we should leave it; would need to check)
[20:05:59] <corvus>	 re entrypoint: exactly
[20:08:23] <hashar>	 +1
[20:08:32] <hashar>	 (the posix part was a tongue-in-cheek)
[20:51:31] <hashar>	 hmm I went to try to create the container in the namespace then to prepare-workspace-openshift it : https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/1152814/12/playbooks/commit-message-validator.yaml
[20:51:46] <hashar>	 that does nott prepare anything https://zuul-dev.wmcloud.org/t/wikimedia/build/76c8f24ebb574ce89df3ee83596ba11d/console
[20:51:58] <hashar>	 cause I guess the container I create is not known to the executor :)
[20:53:03] <hashar>	 I have made some additions to the base pre.yaml  ( https://gerrit.wikimedia.org/r/c/integration/config/+/1153353/1/playbooks/base/pre.yaml ) and those do not show up despite a full reconfiguraiton of the scheduler
[20:53:23] <hashar>	 anyway, I understand now why we need labels
[20:53:48] <hashar>	 so all the images we have should be configured as labels in Nodepool
[20:54:03] <hashar>	 and then jobs simply refers to them instead of trying to spin them up
[21:13:46] <corvus>	 hashar: yeah, when you add a pod-based label to nodepool, the pre-run playbook will run against that pod
[21:14:13] <corvus>	 then you'll want to change the run playbook to an ansible command to run the commit check
[21:14:27] <corvus>	 (instead of "create container")
[21:15:33] <hashar>	 so my issue is  the pre playbook has `hosts: all`
[21:15:45] <hashar>	 but since the container does not exist, it is obviously not running anything :b
[21:17:12] <hashar>	 what I though is in the pre plyabook to create the container and have the returned name added to the inventory
[21:17:19] <hashar>	 anyway I'll go with labels
[21:19:21] <corvus>	 yeah -- just to be clear, what's the goal?  continue exploring the k8s "namespace" approach, or switch to "pod"?  i was assuming switching to pod
[21:20:01] <corvus>	 i'm trying to nudge in that direction because i'm 99% sure that's going to achieve your goal pretty quickly/easily
[21:21:00] <corvus>	 we can totally make the namespace approach work, but it's a lot more work because what you said is exactly right: you would have to create the pod yourself, add it to the ansible inventory, and only then could you run the workspace sync against it.  but all that's handled much easier with the nodepool pod-based label approach.
[21:21:20] <hashar>	 I am switching to pod AND exploring :]
[21:21:21] <hashar>	 sorry
[21:21:41] <corvus>	 non-linear thinking.  :)
[21:21:53] <corvus>	 i know it well
[21:21:59] <hashar>	 yeah that has been the pity of my existence since early child hood
[21:22:11] <hashar>	 it is not even tree like thinking
[21:22:16] <hashar>	 it is the Amazon!
[21:22:54] <hashar>	 maybe I should pair with you instead :)
[21:23:39] <hashar>	 meanwhile, did you get ssh access to the instances that powers https://zuul-dev.wmcloud.org/ ?
[21:24:08] <corvus>	 i haven't switched back to that yet, i am going to try to do that today
[21:25:13] <hashar>	 at least your key is present
[21:26:11] <corvus>	 oh since i hadn't caught up on that, i didn't realize the fix was "someone fixed it" not "i need to make a different key".
[21:26:20] <corvus>	 i just ran the ssh command and it works now
[21:26:29] <corvus>	 corvus@zuul-1001:~$ 
[21:27:10] <hashar>	 the task was https://phabricator.wikimedia.org/T395857 I have CC you on it
[21:27:41] <hashar>	 and that got closed. Turn out your account was super old and its ldap schema did not get updated 
[21:28:24] <hashar>	 that instance has a clone of zuul/zuul.git@11.3.0
[21:28:33] <hashar>	 I have created a symlink at /zuul
[21:29:21] <corvus>	 if i want to see logs, "sudo docker logs" ?
[21:29:36] <hashar>	 sudo docker compose logs -f -n0
[21:29:41] <hashar>	 from /zuul
[21:29:44] <hashar>	 that is the one I have used
[21:30:10] <corvus>	 got it, thx
[21:31:10] <corvus>	 the commit-message-validator label in teh nodepool.yaml lgtm
[21:31:49] <hashar>	 yup
[21:31:55] <hashar>	 I am now doing the run command
[21:54:41] <corvus>	 hashar: the current error about gathering facts has more explanation in the executor log
[21:54:47] <corvus>	 2025-06-03 21:39:44,424 DEBUG zuul.AnsibleJob.output: [e: 9c74c281ceb04e3db45e2d28df968408] [build: d149977d9c1141e387488e910c9d2f68] Ansible output: b'fatal: [linter]: UNREACHABLE! => {"changed": false, "msg": "Failed to create temporary directory. In some cases, you may have been able to authenticate and did not have permissions on the target directory. Consider changing the remote tmp path in 
[21:54:53] <corvus>	 ansible.cfg to a path rooted in \\"/tmp\\", for more error information use 
[21:54:56] <corvus>	 -vvv. Failed command was: ( umask 77 && mkdir -p \\"` echo /root/.ansible/tmp `\\"&& mkdir \\"` echo /root/.ansible/tmp/ansible-tmp-1748986784.1195168-5-99507260522625 `\\" && echo ansible-tmp-1748986784.1195168-5-99507260522625=\\"` echo /root/.ansible/tmp/ansible-tmp-1748986784.1195168-5-99507260522625 `\\" ), exited with result 1", "unreachable": true}'
[21:55:17] <corvus>	 tldr; i think we're expecting a writeable /tmp in the container
[21:55:44] <hashar>	 isn't it also stating something about ssh,
[21:55:45] <hashar>	 ?
[21:56:11] <corvus>	 yeah, it's a misleading error in the web ui; the one from the log is better
[21:57:05] <hashar>	 ah that error you pasted is for the "linter" hosts which is the pod
[21:57:06] <corvus>	 oh here it does show up in the console log: https://zuul-dev.wmcloud.org/t/wikimedia/build/d149977d9c1141e387488e910c9d2f68/log/job-output.txt#15
[21:57:43] <corvus>	 so you don't need to go to the executor log to see it, but it does not show up in the "console" tab, which is unfortunate.  i'll have to see what we could do about that in the future.
[21:58:04] <hashar>	 that is handy
[21:58:07] <corvus>	 anyway, that's the first error; fix that and the rest should fall in line.  the errors about "ssh" are really there because of the first failure.
[21:59:35] <corvus>	 so is there a writeable space in that container?  if so, we may be able to tell ansible to use a different location, or user.  or we could update the pod config in nodepool to add a new tmpfs volume to the pod at /tmp.
[21:59:41] <hashar>	 good question :)
[22:00:09] <hashar>	 also it has `USER nobody`
[22:00:34] <corvus>	 might need to change that then too :)
[22:01:26] <hashar>	 can that be configured from nodepool?
[22:02:21] <hashar>	 https://zuul-ci.org/docs/nodepool/latest/kubernetes.html#attr-providers.[kubernetes].pools.labels.spec !
[22:03:23] <corvus>	 yep that's the escape hatch
[22:05:00] <corvus>	 there isn't a dedicated nodepool setting to change the user, so if we need to do that, we'll switch to speciying the entire spec ourselves.  we'll just need to include the parts that nodepool currently includes (like the "command" override)
[22:08:15] <corvus>	 hashar: https://paste.opendev.org/show/bbDWigjK1kpimqNJ5sGc/
[22:08:29] <corvus>	 that should be equivalent to what's there now, then we can add whatever else we need to that
[22:08:47] <hashar>	 yup I got that from the doc
[22:08:49] <corvus>	 (except i snuck in a switch to "sleep infinity" if you want to try that out and we can see if it works :)
[22:09:22] <hashar>	 I am trying to find whether the user can be changed in Kubernetes ( https://kubespec.dev/v1/Pod )
[22:10:22] <hashar>	 anyway
[22:10:31] <hashar>	 I imagine the container default to `USER root`
[22:10:42] <hashar>	 then the ansible base playbook drops privilege
[22:10:49] <hashar>	 so we would need that command to be in the images
[22:11:04] <corvus>	 you mentioned this image has "USER nobody" so that's what it'll use
[22:11:20] <corvus>	 https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-the-security-context-for-a-pod
[22:11:27] <corvus>	 "runAsUser" is what you want i think
[22:11:43] <corvus>	 (that needs to be a uid, not a string)
[22:12:33] <corvus>	 that's only going to be necessary if there is no place for the nobody user to write to; probably some more inspection of the image is necessary to determine that
[22:13:23] <corvus>	 we may also be able to ignore that, and just make sure that /tmp and wherever we want to put the workspace are writeable by nobody
[22:13:39] <corvus>	 so we may be able to ignore the user id and just add volume mounts
[22:14:01] <hashar>	 or tell ansible to write to /tmp instead of /root/.ansible/tmp
[22:14:18] <hashar>	 the later is definitely not writable by "nobody"
[22:16:01] <corvus>	 oh i missed that it was /root/ -- that does sort of suggest that it's running as root; i'm surprised that's the case with the USER layer in the container image.
[22:16:30] <corvus>	 okay, so, easiest fix is to mount a tmpfs at /root
[22:17:02] <corvus>	 i'd start with that, then we can decide later if we want to change users or the image, etc.
[22:17:23] <corvus>	 so you can ignore the "spec" for now, and just use "volumes" instead
[22:18:04] <hashar>	 where is that TASK [Gathering Facts ]  running?
[22:18:15] <hashar>	 isn't it running on the Executor/localhost?
[22:18:46] <corvus>	 that's an ansible process on the executor trying to start a python process inside the container to learn more about the container
[22:19:27] <corvus>	 ansible works by coping over small programs (usually python scripts) then running them
[22:19:50] <corvus>	 the step where it tried to copy the python script to the remote host (container) is what failed
[22:19:56] <corvus>	 it tries to put the script in that temp dir
[22:20:27] <hashar>	 adn could it be that it fails to run that command cause it attempts to SSH into linter?
[22:20:54] <hashar>	 it refers to https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/zuul3/playbooks/base/pre.yaml
[22:22:08] <hashar>	 that one used to only have `add-build-sshkey`
[22:22:15] <corvus>	 the first error, the one that doesn't mention ssh at all but instead mentions the write failure, is run by the zuul executor before the playbook actually starts
[22:22:28] <corvus>	 at that point, it does know that it's a k8s pod, and it knows how to connect to it
[22:23:32] <corvus>	 because it fails to get any information about the pod, there isn't enough information for the later pre-run playbook to operate properly, and it tries to get that information again and fails, and outputs the less-than-helpful error about ssh
[22:23:46] <hashar>	 ahhhh
[22:23:55] <hashar>	 so the failure cascades to the other steps
[22:24:28] <corvus>	 (the fact that it's the zuul executor itself that runs the fact gathering is the reason why it only shows up in the text log and not the "console" tab in the web ui; because we don't write those errors out to there... because, erm, they don't happen that often and it just didn't occur to us to do that, oops
[22:24:53] <corvus>	 i think this should create a writeble homedir for root: https://paste.opendev.org/show/byErpaZXaksNT4qxrvTj/
[22:25:24] <hashar>	 you are faster than me :)
[22:26:07] <hashar>	 copied
[22:26:12] * hashar retests
[22:28:35] <hashar>	 https://zuul-dev.wmcloud.org/t/wikimedia/build/56eddb073cbe49ad871f9261ccef7d92/console :)
[22:28:45] <hashar>	 well done corvus
[22:29:16] <hashar>	 that add-build-sshkey would be the one from https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/zuul3/playbooks/base/pre.yaml
[22:29:25] <hashar>	 which was made for the static host
[22:33:54] <corvus>	 yep, don't need it for pod-based jobs
[22:34:10] <hashar>	 I am dropping those bits
[22:34:14] <corvus>	 so if we end up with a system with some pods and some vms, we'll need two different base jobs (which is fine and expected)
[22:35:32] * hashar nods
[22:35:46] <corvus>	 hashar: you can drop the entire cleanup playbook too https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/a0b7c1d0bfa29c0b1fe6142f33013d0c5d9fa61f/playbooks/base/cleanup.yaml
[22:36:08] <corvus>	 https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/a0b7c1d0bfa29c0b1fe6142f33013d0c5d9fa61f/zuul.d/jobs.yaml#14
[22:36:14] <hashar>	 yeah
[22:36:20] <hashar>	 we should have started with an orphan branch
[22:38:07] <hashar>	 done
[22:38:16] <hashar>	 I wll check https://zuul-dev.wmcloud.org/t/wikimedia/build/6d3bc60cf0b44dc2b5d683602287f9f7
[22:38:31] <hashar>	 since the container runs as nobody it does not have a home directory :)
[22:38:47] <hashar>	 well its home is set to `/nonexistent` which , well, does not exist
[22:41:06] <corvus>	 okay, i'm a little confused how we ended up using /root at the start and /nonexistent later... i would expect that to be consistent.
[22:42:23] <hashar>	 :]
[22:42:55] <corvus>	 (perhaps there's a fallback mechanism in ansible to try /$USER if $HOME doesn't work)
[22:43:25] <hashar>	 I am happy to ignore that for now
[22:44:05] <corvus>	 at any rate, could add another tmpfs to /nonexistent
[22:44:41] <corvus>	 ("the increasingly inaccurately named directory, 'nonexistent'")
[22:44:48] <hashar>	 then that is for zuul_output_dir
[22:45:00] <hashar>	 so I imagine we can do a mount at something like /output
[22:45:09] <hashar>	 and set zuul_output_dir: /output
[22:45:31] <corvus>	 yeah, that part isn't that important, except that a lot of roles assume that the homedir is writeable, so if we make that so, then things are a lot easier
[22:46:35] <hashar>	 indeed
[22:46:38] <corvus>	 i was going to say that should affect prepare-workspace-openshift, but it did not error, which makes me wonder where it put the repos
[22:46:51] <hashar>	 I think we did that to force a failure whenever npm/composer/etc wrote stuffunder /home
[22:47:25] <hashar>	 when instead we want it written to XDG_CACHE_HOME=/cache so we can save it. Something like that
[22:47:25] <corvus>	 ah, well, if you do want to maintain that feature, we can certainly specify a different path for zuul roles to write to
[22:47:50] <corvus>	 must of them should honor a variable called "zuul_workspace_root".  we just might find a few places that don't and need to be updated.
[22:49:08] * hashar then there are a lot of place using `ansible_user_dir`
[22:49:35] <corvus>	 in theory, that should be used as a default for zuul_workspace_root
[22:50:00] <hashar>	 I mean, there are plenty of other roles that do use `ansible_user_dir` for their default
[22:50:34] <hashar>	 it is reasonable to assume a home directory exist
[22:50:40] <corvus>	 okay, i'm guessing the image sets a WORKDIR and that is probably where the openshift role puts the repos
[22:50:57] <corvus>	 it does not have an option to specify a different location.  that is an oversight, and that should be added.
[22:51:15] <corvus>	 (the pod roles in zuul-jobs are not as well maintained as the vm ones)
[22:51:20] <hashar>	 our images would need to be tuned
[22:51:54] <hashar>	 but that is fine. And I expect some would simply disappear
[22:52:30] <corvus>	 yeah, i think it's reasonable to fix all of this by brining the images in-line with what zuul expects.  i just don't want to be prescriptive, so i think it's also okay to fix this by adapting the zuul jobs to what your images expect.
[22:53:58] <hashar>	 +1
[22:54:30] <hashar>	 us USER nobody  is really dated
[22:54:54] <hashar>	 that is cause the Jenkins job does `docker run -v /src:/src commitmessage-validator`
[22:55:14] <hashar>	 and running as nobody sounded nicer than running as root :)
[22:58:15] <hashar>	 if I replace the /root mount by only a /nonexistent mount
[22:58:20] <hashar>	 ansible ccomplains again https://zuul-dev.wmcloud.org/t/wikimedia/build/14e4066baa674d798d9e20ab496b34a6/log/job-output.txt
[22:58:28] <hashar>	 so I will mount both
[23:11:46] <hashar>	 corvus: SUCCESS ! https://zuul-dev.wmcloud.org/t/wikimedia/build/0a6e97513b4c4c9a9401690e5fe13eff
[23:11:50] <hashar>	 subprocess.CalledProcessError: Command '('git', 'config', '--get-regex', '^remote.*.url$')' returned non-zero exit status 1.
[23:11:54] <hashar>	 that comes from our tool :)
[23:12:27] <hashar>	 which I guess runs from the wrong dir
[23:12:51] <hashar>	 I did a `chdir: src`
[23:13:06] <corvus>	 yeah, we're going to need to figure out where the sync role actually put the repos :)
[23:13:18] <hashar>	 I imagine the prepare-workspace-openshift would copy the repo as `src/test/gerrit-ping`
[23:13:59] <corvus>	 yeah.  do you need to chdir to src/test/gerrit-ping for your tool?
[23:14:05] <hashar>	 maybe :)
[23:14:26] <hashar>	 most of our jobs are for single repos
[23:14:41] <hashar>	 so we just cloned that single repo from the zuul-merger to a src directory
[23:15:01] <hashar>	 then the container have that src directory mounted as /src which is the WORKDIR
[23:15:29] <corvus>	 so then in the playbook you would use `chdir: "{{ zuul.project.src_dir }}"`
[23:15:49] <corvus>	 that's a jinja template that should render as src/test/gerrit-ping
[23:16:17] <corvus>	 (read that as "the source directory for the project which triggered this zuul job")
[23:16:17] <hashar>	 that is the project that triggered the build?
[23:16:27] <hashar>	 similar to $ZUUL_PROJECT in the old days?
[23:16:32] <corvus>	 yep
[23:16:47] <hashar>	 well it is all boring
[23:16:58] <hashar>	 we have been using the same concept for half a century
[23:17:02] <hashar>	 concepts
[23:17:20] <corvus>	 hehe, exepct now there can be more than one project :)
[23:17:29] <hashar>	 and everything is in yaml
[23:17:30] <corvus>	 but for the case you dsecribe, that's the same
[23:18:28] <hashar>	 2025-06-03 23:18:14.312955 | TASK [Validate]
[23:18:28] <hashar>	 2025-06-03 23:18:16.546973 | linter | ok: Runtime: 0:00:00.139867
[23:18:47] <hashar>	 https://zuul-dev.wmcloud.org/t/wikimedia/build/c0bc3db424174be8a76614f51df04a47
[23:18:50] <hashar>	 and it is not even 2am!
[23:19:12] <corvus>	 what will you do with all the extra time!
[23:19:26] <hashar>	 sleep cause I wake up in 6 hours :b
[23:20:01] <hashar>	 I also have a sailing competition  starting on Thursday evening. So I guess tomorrow I should really sleep :b
[23:20:08] <hashar>	 more seriously
[23:20:19] <hashar>	 that reached the goal of running one of our existing CI image "as is"
[23:20:41] <corvus>	 https://zuul-dev.wmcloud.org/t/wikimedia/build/c0bc3db424174be8a76614f51df04a47 looks good to me!
[23:21:22] <hashar>	 there are a few keys results I got such as using pod
[23:21:34] <hashar>	 defining the images as labels
[23:22:05] <corvus>	 i do note that the stdout isn't making it into the streaming log (it is in the structured log: https://zuul-dev.wmcloud.org/t/wikimedia/build/c0bc3db424174be8a76614f51df04a47/console#1/0/0/linter )
[23:22:35] <corvus>	 before we dig into that, we should upgrade the test environment to zuul v12 (or, better, just the :latest images)
[23:23:03] <hashar>	 zuul-1001.zuul3.eqiad1.wikimedia.cloud is all your!
[23:23:08] <corvus>	 there were some changes to that recently and i want to make sure that isn't already fixed, and if it does need a fix, that we build on that
[23:23:30] <corvus>	 okay, then i may go ahead and do that now
[23:23:53] <hashar>	 see /zuul/docker-compose.yaml   I think the images have a pinned version
[23:23:56] <corvus>	 haha emacs not found.  story of my life.
[23:24:01] <hashar>	 ahah
[23:24:11] <corvus>	 it's okay, i know how to use the other thing
[23:24:21] <hashar>	 feel free to install emacs
[23:24:38] <hashar>	 that instance is on WMCS so it is really just for dev purposes
[23:24:58] <hashar>	 it is not provisioned with Puppet or anything
[23:25:43] <corvus>	 i'm going to switch to latest; obvs i don't expect you to do that in production, but i do expect there to be a at least a 12.1.0 that will be closer to :latest by the time you get there.
[23:25:48] <hashar>	 +1
[23:26:14] <hashar>	 and the commit I have used is https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/1152814
[23:26:45] <hashar>	 the new Zuul reacts to "retest" (but does not report back)
[23:27:23] <hashar>	 I am off! I am happy to have managed to run a container :]
[23:27:36] <corvus>	 yay!  goodnight and sleep well!
[23:28:22] <hashar>	 self goal for tomorrow: write down a summary on the task about microk8s https://phabricator.wikimedia.org/T395826   then I guess update the design doc
[23:28:26] <hashar>	 see you and happy upgrade
[23:32:41] <corvus>	 https://zuul-dev.wmcloud.org/components upgraded
[23:32:57] <hashar>	 corvus: and did you get emacs? :)
[23:33:03] <corvus>	 nah :)
[23:34:07] <hashar>	 that will be for another day. I am off for real
[23:34:17] <corvus>	 https://zuul-dev.wmcloud.org/t/wikimedia/build/fb891e09df3f4982a3dbdeca4cc97f7b
[23:34:36] <corvus>	 job still works, but it looks like i do need to look into why the stdout isn't showing there\
[23:43:44] <corvus>	 remote:   https://gerrit.wikimedia.org/r/c/integration/config/+/1153391 Add start-zuul-console to pre-run playbook [NEW]        
[23:43:59] <corvus>	 i think that's the issue with the missing output from the log