[07:09:42] morning [07:20:32] morning! [07:30:37] dcaro: still struggling to get my dev env fully functional https://www.irccloud.com/pastebin/7bg1VavY/ [07:32:30] also, jobs-api is crashing in lima-kilo (but I don't care about that right now) [07:33:23] that's the toolforge cli not reaching the api gateway [07:33:54] morning [07:33:55] xd, I really dislike when they say connection refused, but they don't show the url [07:34:19] DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): https:443 [07:34:30] blancadesal: did you do this? https://www.irccloud.com/pastebin/rUuLuXvm/ [07:36:00] (I can't setup lima-kilo locally at the moment, so can't check exact details) [07:36:35] you can also check if the api-gateway is there and running [07:36:54] https://www.irccloud.com/pastebin/HRVQkObA/ [07:37:11] https://www.irccloud.com/pastebin/tJGz1x8O/ [07:37:29] oh nice, so you can use the ip 127.0.0.1 on the toolforge config file [07:37:39] I did that :( [07:37:52] okok, interesting [07:37:59] what do you see in the logs of the api gateway [07:38:00] ? [07:38:14] (kubectl logs --tail 100 -f -n api-gateway deploy/api-gateway) [07:38:45] sorry, `kubectl logs --tail 100 -f -n api-gateway deploy/api-gateway-nginx` [07:38:49] with the `-nginx` [07:39:13] https://www.irccloud.com/pastebin/Nj5zOi2e/ [07:39:44] does it get a new entry when you run the cli? [07:40:52] nope [07:41:04] okok, then the issue is between the cli-api [07:41:47] can yo ucat /home/vagrant/.toolforge-lima-kilo/chroot/data/project/tf-test/.toolforge.yaml ? [07:42:44] https://www.irccloud.com/pastebin/vh5nhdvs/ [07:42:58] argh [07:43:29] there's a double https there xd [07:43:37] yup xd [07:44:06] maybe that's what the log meant: `Starting new HTTPS connection (1): https:443` hahaha [07:44:46] okkkkayyyy, that was just plain STUPID [07:45:00] https://www.irccloud.com/pastebin/XCTR52F4/ [07:45:16] My kind of issues :) [07:45:40] that might solve also toolforge-jobs (as it should be using the same conf file) [07:46:15] I had so many other issues that I managed to solve so didn't think about double checking this one because it seemed straighforward?? [07:47:18] let's see jobs-api now [07:48:10] https://www.irccloud.com/pastebin/WRbruURR/ [07:50:38] hmm, `kubectl logs jobs-api-844484fdd5-54t9m -n jobs-api -c webservice` gives [07:50:38] --- no python application found, check your startup logs for errors --- [07:50:38] [pid: 13|app: -1|req: -1/1236] 10.244.0.1 () {34 vars in 380 bytes} [Fri Sep 22 07:48:55 2023] GET /healthz => generated 21 bytes in 0 msecs (HTTP/1.1 500) 2 headers in 83 bytes (0 switches on core 0) [07:50:54] look a bit more up in the logs [07:53:47] https://www.irccloud.com/pastebin/BmTHx2dP/ [07:54:00] blancadesal: you need image-config [07:54:20] that shouldn't fail with "Temporary failure in name resolution" [07:54:20] it's kind of a configmap with the list of images it can use [07:54:55] yep, that's a bit weird [08:06:41] in the meantime, the build pipeline failed [08:06:44] https://www.irccloud.com/pastebin/G7kefKFs/ [08:07:12] it does not seem to have read access to the harbor at 172.17.0.1:8080/tool-tf-test/tool-tf-test:latest [08:07:58] is harbor running on that ip/port? [08:11:00] argh, need to restart the laptop [08:11:08] xd [08:12:40] afaik harbor is running on localhost, where does 172.17.0.1 come from? [08:17:11] when deploying the tekton side of things it tries to detect the ip harbor is in [08:18:01] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/deploy.sh#L31 [08:18:02] https://www.irccloud.com/pastebin/HwNFObN6/ [08:18:07] gets it from the environment variable [08:19:25] the instructions are in the readme https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/README.md#setup-harbor [08:19:46] it's for minikube though [08:20:04] I was not involved in the lima-kilo side of things, let me look [08:20:48] it seems to be defined inside lima-kilo: `lima_kilo_harbor_addr_for_within_k8s: "{{ lima_kilo_docker_addr }}:{{ lima_kilo_harbor_config_port }}"` [08:21:18] it's hardcoded there yep [08:21:21] https://www.irccloud.com/pastebin/bwr8mfrD/ [08:22:05] hey! [08:22:24] I guess that does not apply for the vagrant setup? [08:22:27] I wrote that :-( it is the most "common sense" default I could find [08:23:13] that's ok, just trying to figure out what's what [08:23:19] could that be a perms error? [08:23:26] instead of IP access error? [08:23:31] like the robot account or something [08:23:35] it could [08:26:05] I can reach harbor on http://172.19.0.7:8080 too [08:26:57] good, so next we can check if the dockerconfig for the local repo is correct [08:27:09] as it seems the builds-api was able to create the repository [08:28:23] can you do `kubectl -n image-build get secret dockerconfig -o json | jq -r '.data[".dockerconfigjson"]' | base64 -d`? [08:28:32] are there different perms required to create the repo and to "ensure registry read access" ? [08:29:03] one uses the api directly, the other uses docker [08:29:07] *docker like access [08:29:17] vagrant@bullseye:~/lima-kilo$ kubectl -n image-build get secret dockerconfig -o json | jq -r '.data[".dockerconfigjson"]' | base64 -d [08:29:17] Error from server (NotFound): secrets "dockerconfig" not found [08:29:46] interesting, what secrets does it have? (`kubectl -n image-build get secrets`) [08:30:02] it should have applied https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/deploy/devel/dockerconfig.yaml.template [08:30:22] unless lima-kilo is not using the devel deploy [08:30:27] mmm [08:30:31] that should be the problem! [08:30:51] https://www.irccloud.com/pastebin/gr0fdsmc/ [08:31:03] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/blob/main/roles/k8s/defaults/main.yaml#L19 [08:31:14] it is using `local` dcaro [08:31:49] the buildservice was one of the first, so it's not using the standard yet (code for that is up for review), it's using `devel` instead as `local` was decided after [08:32:14] ok [08:32:30] that should be an easy patch in lima-kilo then [08:32:42] this should fix that too: https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/merge_requests/9 [08:32:54] but yep, lima-kilo side might be easier to review [08:34:21] maybe blancadesal can do a local change to unblock herself [08:36:05] so change local to devel? and the lima_kilo_docker_addr too? [08:36:16] should be roles/k8s/defaults/main.yaml, instead of `./deploy.sh toolsbeta` you can use `./deploy.sh devel` [08:37:38] it seems that the lima_kilo_docker_addr is ok no? (you were able to curl to it) [08:38:22] so only that deploy command, that will create the dockerconfig to be able to use an http registry (insecure_registries setting) [08:38:33] mine is 172.19.0.1 instead of 172.17.0.1 [08:38:38] oh, then also yes [08:47:11] arturo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/959955/ [08:48:27] taavi: thanks [09:01:29] dcaro: the builds-builder MR +1'd [09:02:47] thanks! [09:04:10] I'll deploy on monday (in needs to run some migration scripts and such) [09:16:37] folks how we doing network-wise? any remaining niggles or are we happy things are working ok following the changes earlier in the week? [09:17:23] topranks: I think we are mostly OK for now! [09:17:49] would you like to schedule some kind of recap/what-next meeting next week? [09:17:56] ok cool! [09:18:20] yeah that might not be a bad idea, I know we've a bunch of other hosts to move, cloudcontrols, cloudrabbit, not sure what else [09:18:32] might be good to do a review on exactly where we are at [09:18:49] ok [09:28:58] topranks: yeah, we have cloudcontrols, cloudrabbit and cloudvirt-wdqs left to move, but all of those should be fairly straightforward [09:30:52] Ok [09:31:08] The cloudcontrols were tricky before, but I think we can do them 1 by 1 and be ok hopefully [09:34:03] dcaro: hmm, builds-api has no devel env though? just local/toolsbeta/tools? [09:34:16] https://www.irccloud.com/pastebin/vB9Ayat7/ [09:34:18] On my list from the other day I also have clouddb, cloudmetrics, cloudelastic and clouddumps [09:34:18] yes, 'devel' was replaced by 'local' [09:34:36] The first two of those are on wmf private IP space, last two on public addressing. [09:34:40] so any new component (soon all) do 'local' instead of 'devel' [09:34:58] though just running 'deploy.sh' should be able to figure out the right env too [09:35:08] (for all but buildservice/tekton) [09:35:19] hmm, even that one might [09:35:22] taavI: I'm not quite sure where they fit into the picture, or if perhaps it's just a co-incidence they have names starting "cloud" but aren't really WMCS servers [09:35:24] so which one is it that needs to change from 'local' to 'devel' in the lima-kilo setup? [09:35:31] buildservice [09:35:50] builds-builder is the name now [09:36:51] https://www.irccloud.com/pastebin/GWFixv76/ [09:37:00] ^ in roles/k8s/defaults/main.yaml [09:37:10] (that's the already changed stuff) [09:37:17] oh, okay, so it's this one? https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/blob/main/roles/k8s/defaults/main.yaml#L46 [09:37:32] yep [09:37:50] blancadesal: let me send a patch [09:39:00] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/78 [09:39:01] ok. about the docker network address, I think I can force it through vagrant so it matches the hard-coded one [09:39:32] 👍 [09:41:20] arturo: testing [09:42:18] blancadesal: it fails! I should complain about a missing var, let me fix that [09:42:32] ./deploy.sh: line 31: HARBOR_IP: unbound variable [09:45:00] refreshed the patch, should work now! [09:49:40] thanks! launching another build run to test [09:53:00] argh, same error [09:53:01] step-analyze: 2023-09-22T09:51:04.595021435Z ERROR: failed to initialize analyzer: validating registry read access: ensure registry read access to 172.19.0.1:8080/tool-tf-test/tool-tf-test:latest [09:53:01] the harbor ip/port is fine this time [09:53:27] (jobs-api fixed itself btw) [09:53:50] blancadesal: I get the same error here in my laptop, so at least that's consistent :-) [09:54:31] yay I guess? xd [09:55:32] can you run `kubectl -n image-build get secret dockerconfig -o json | jq -r '.data[".dockerconfigjson"]' | base64 -d` ? [09:55:50] dcaro: [09:55:51] {"insecure-registries": ["http://172.17.0.1:8080"]} [09:55:56] that is not the same ip [09:56:15] oh, but that's arturo's side, maybe that's ok [09:56:18] it should be the harbor ip [09:56:33] dcaro: i didn't change the network config in vagrant yet, just the hard-coded one for now [09:56:51] yeah my harbor is reachable in there [09:56:58] so there should be a difference between my and arturo's addresses [09:57:16] yes, your's should have `http://172.19.0.1:8080` [09:57:32] yes, that's correct [09:58:09] does docker image address syntax support passing ports like in `172.19.0.1:8080/tool-tf-test/tool-tf-test:latest`? I don't think I've seen that used before [09:58:54] good point [09:58:57] it's not docker directly, it's the lifecycle that uses that [09:59:17] might be misconfigured though [10:01:12] blancadesal: can you run `kubectl -n image-build get secret basic-user-pass -o yaml` ? that's the user/pass it's trying to use [10:01:26] the annotation `tekton.dev/docker-0: http://192.168.1.101` should point to that same ip/port [10:02:11] https://www.irccloud.com/pastebin/T6pETPTn/ [10:02:45] tekton.dev/docker-0: http://172.17.0.1:8080 [10:02:45] that's a different ip [10:02:58] ah, arturo again xd [10:03:02] hahahah, I get confused [10:03:14] https://www.irccloud.com/pastebin/F08ZEH4m/ [10:03:15] yes, that should be ok in your case (the user/pass look ok too) [10:03:22] that looks ok too [10:04:50] blancadesal: can you try doing 'docker login http://172.19.0.1:8080' using that user/password? (robot$tekton - Dummyr0botpass) [10:05:10] can we tcpdump what it's doing? is it even trying to contact harbor? [10:05:13] you might have to configure your docker to allow that insecure registry [10:06:11] taavi: that's an option yes, we can also check the logs on harbor side [10:06:29] I get this [10:06:44] http: server gave HTTP response to HTTPS client [10:07:34] that's the insecure registries configuration, you will have to add it to docker https://docs.docker.com/registry/insecure/ [10:08:04] essentially adding the dockerconfig we saw above to `/etc/docker/daemon.json` and reloading docker [10:09:05] permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/auth": dial unix /var/run/docker.sock: connect: permission denied [10:09:37] probably needs sudo (or a user in the docker gorup) [10:09:38] *group [10:09:56] dcaro: now I get `unauthorized: authentication required` [10:10:48] did it ask for user/password? if not you might have to pass it (--user <> --password-stdin, or similar) [10:10:55] ah yes, I was trying from the tool user account. now I get `Error response from daemon: Get "https://172.19.0.1:8080/v2/": http: server gave HTTP response to HTTPS client` [10:11:11] ^ yep, that's the insecure registry config [10:12:04] `echo '{"insecure-registries": ["http://172.19.0.1:8080"]}' > /etc/docker/daemon.json && systemctl restart docker` [10:12:08] or similar [10:12:16] (reload might be enough?) [10:15:28] from harbor logs [10:15:30] failed to authenticate user:robot$tekton, error:Failed to authenticate user, due to error 'Invalid credentials' [10:15:49] I got: `Error response from daemon: Get "http://172.19.0.1:8080/v2/": dial tcp 172.19.0.1:8080: connect: connection refused` [10:16:19] arturo: you can try logging into the web ui with admin/Harbor12345 and check if the robot account exists, and reset the password [10:16:30] blancadesal: that's different, can you try curling that url? [10:16:50] is harbor still up and running? [10:17:14] hmm... tools-prometheus-6 prometheus process got OOMkilled, it has 32G of ram :/ [10:17:21] https://usercontent.irccloud-cdn.com/file/ncBibmqA/image.png [10:17:38] maybe this is the prefix thing [10:18:13] only 3 of the harbor containers are up after the docker restart [10:18:19] that looks ok to me, if the password is ok [10:19:45] dcaro: the password was wrong! [10:20:00] I reset it to the expected value and now I get [10:20:04] https://www.irccloud.com/pastebin/CAPRGk2F/ [10:20:09] nice, that should work then [10:20:57] I guess there is an encoding problem somewhere [10:21:03] when creating the account with lima-kilo [10:24:00] I don't like that there's two ways of setting up harbor [10:25:07] helm and docker-compose? [10:25:08] could it be that it already existed? [10:25:20] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/utils/setup_harbor.py [10:25:22] essentially [10:25:31] and lima-kilo/ansible [10:26:03] as that script has been duplicated inside ansible itself [10:27:01] arturo: can you remove the robot account and rerun lima-kilo? I'm thinking that the account might have already been created or something [10:27:10] and ansible seems to ignore if that's the case [10:28:09] I usually cycle the lima-kilo install [10:28:23] just checked, and it returns "status": 201, on the account creation [10:28:25] does that destroy harbor volumes? [10:28:26] meaning "ok, created" [10:28:34] okok [10:28:47] let me double check the uninstall [10:29:44] dcaro: how can I check if volumes have been cleaned up? [10:30:04] docker-compose mounts data directories, I don't know where lima-kilo puts them [10:30:42] that would be [10:30:42] lima_kilo_harbor_config_data_volume: "{{ lima_kilo_harbor_path }}/data" [10:30:54] lima_kilo_harbor_path: "{{ lima_kilo_local_path }}/harbor" [10:31:00] so, something like [10:31:09] ~/.toolforge-lima-kilo/harbor/data [10:31:37] maybe, probably configured `lima_kilo_harbor_config_template_path: "{{ lima_kilo_harbor_path }}/harbor/harbor.yml.tmpl"` here [10:32:30] https://www.irccloud.com/pastebin/qRf2oEJ1/ [10:32:55] lima_kilo_harbor_config_data_volume: "{{ lima_kilo_harbor_path }}/data" [10:33:01] Y think that's it yes [10:34:55] running the uninstall playbook deletes that dir [10:35:01] so there is nothing in there [10:35:37] do you have the exact request ansible does? [10:36:07] yes [10:36:09] https://www.irccloud.com/pastebin/eboxAMSW/ [10:38:11] the response seems to be [10:38:15] "name": "robot$tekton", [10:38:15] "secret": "Ky5TWwq9r9CJbTfF2wNQZWcaspuJ06s2" [10:38:28] so I wonder if harbor is ignoring the password and setting its own [10:38:49] that would explain why re-setting it via the webpanel fixes the problem [10:40:37] may be, but this was working with the harbor version in prod [10:40:46] did we upgrade the version in lima-kilo? [10:41:03] https://www.irccloud.com/pastebin/trIFO4fG/ [10:41:10] lima-kilo uses 2.5.4 [10:41:20] which I believe is the same as prod? [10:41:29] I could be wrong [10:41:44] yep HARBOR_VERSION=${HARBOR_VERSION:-v2.5.4} # we use this for now [10:47:44] the setup script works [10:47:49] tested on the vagrant setup [10:49:21] this is missing in the ansible side probably [10:49:21] https://gitlab.wikimedia.org/repos/cloud/toolforge/builds-builder/-/blob/main/utils/setup_harbor.py#L76 [10:49:52] ha! so indeed the passwords needs reset [10:52:06] https://github.com/goharbor/harbor/issues/16884 [10:53:41] (and it is by design) [10:54:24] makes sense [10:58:51] how did you test the setup? [10:59:28] blancadesal: btw. I was having the same issue with jobs-api on the vagrant deployment, removing the pods seemed to work [11:00:06] dcaro: probably with a hundred local hacks that weren't correctly persisted into lima-kilo [11:00:53] sounds familiar yes xd [11:06:24] hmm, I see lima-kilo sets /etc/wmcs-project to toolsbeta... that might be an issue [11:06:39] in any case, will pop up when trying to use it [11:08:16] arturo: yes it solved itself after reprovisioning [11:12:52] topranks: I filed for T347148 the future of cloudmetrics, and will talk to o11y since they want to move the current cloudmetrics services to different hosts [11:12:54] T347148: Determine how to monitor services in cloud-private / cloudlb - https://phabricator.wikimedia.org/T347148 [11:14:35] yep, if we stick to having 'toolsbeta' for lima-kilo, we will have to do changes in other places (builds-admission for example will not allow builds from the custom harbor) [11:23:17] Anything going on with volumes in eqiad1? Since last night they've been getting stuck in a detaching state when I try to move or remove them [11:23:52] Rook: I wonder if ceph nodes are missing some network connection to the openstack API [11:24:31] Rook: maybe check cinder logs on cloudcontrol1005 and see if something is reported? [11:24:37] Rook: do you have an example volume I could look at? [11:24:46] PAWS has a few stuck [11:27:36] hmm, one of them shows `os-vol-host-attr:host | cloudcontrol1007@rbd#RBD`. cloudcontrol1007 is currently offline because it's waiting to be physically moved to an another rack [11:27:48] although an another one says 1005, so that might not be it [11:32:04] blancadesal dcaro this is my proposed fix [11:32:05] https://gitlab.wikimedia.org/repos/cloud/toolforge/lima-kilo/-/merge_requests/79 [11:32:34] * arturo errand [11:38:31] I can test later, looks ok, feels a lot like trying to do something with ansible that would be easier with python/programming language though xd [11:42:51] can I get a review for https://gerrit.wikimedia.org/r/c/operations/puppet/+/960023/? [12:33:48] alert -> HAProxy service nova-api_backend backend cloudcontrol1005.private.eqiad.wikimedia.cloud is down [12:33:56] known ? [12:34:34] resolved itself? [12:37:53] I was not aware [12:39:24] the alert did not include the host, but perhaps it was cloudlb1002 with the problem you detected yesterday [12:39:36] arturo: testing [12:39:55] blancadesal: I'm afraid the build still triggers [12:39:55] step-analyze: 2023-09-22T12:34:56.010907598Z ERROR: failed to initialize analyzer: validating registry read access: ensure registry read access to 172.17.0.1:8080/tool-tf-test/tool-tf-test:latest [12:39:59] :-( [12:42:19] arturo: are you able to do docker login, and curl from within a k8s pod? [12:42:56] let me try [12:52:51] didn't we see this earlier? [12:52:53] https://www.irccloud.com/pastebin/wTs0au8k/ [12:53:23] dockerconfig missing? [12:53:31] did you deploy with 'devel' as environment? [12:54:02] I think so, yes, with [12:54:03] HARBOR_IP={{ lima_kilo_harbor_addr_for_within_k8s }} ./deploy.sh devel [12:54:10] but I can double check [12:54:40] you can try getting it with 'kubectl get secret -n image-build dockerconfig [12:55:10] to try locally, you'll need too edit your docker configuration though, under /etc/docker/daemon.json [12:55:29] manually recreated the robot account and was able to docker login but the build run still fails with the same error msg :( [12:55:38] mmm wait, I wasn't trying in the right namespace [12:56:53] blancadesal: do you see the auth logs on the harbor side [12:57:03] ? [13:00:01] about to start a meeting, will check a bit later [13:00:13] ack [13:06:03] I'm having a hard time attaching to the lifecycle poid [13:06:08] pod* [13:09:02] that one has nothing though [13:09:36] yeah, I'm trying to attach with a debug debian container image, which theoretically is something you can do in k8s [13:12:01] I think that starting a container in the same namespace might be enough to test [13:12:21] you might need to mount the dockerconfig [13:12:23] will it mount the secrets etc? [13:12:53] it won't by itself [13:13:05] I figured :-( [13:13:26] I need to go lunch, will continue later [13:26:41] There's no meaningful way to move all the state from one openstack cluster (VMs, DBs, volumes, etc) and have it appear in a different openstack cluster, is there? [13:29:47] I'm guessing that unless you are restoring a backup (same cluster id/nodes/...) then probably no [13:34:29] Yeah, it felt like a little too much to hope for. Thank you for the answer [13:41:24] btw. I'm playing with codfw ceph stuff, so things might get wonky there [14:30:00] I'm trying to give some advice to Raymond_Ndibe about T344673, I think he has the required permissions to manually reboot all those hosts (as in ssh + reboot), but for the Ceph one he should probably use the cookbooks roll_reboot_mons and roll_reboot_osds. were those cookbooks ever tested on the prod cluster? [14:31:14] and about cloudbackup1003 and cloudbackup1004, is it safe to manually reboot them (ssh + reboot) one at a time? [14:32:56] for ceph, yes those cookbooks were tested and used in prod, but some time ago [14:33:54] do you think they require global root? from a quick look, they don't downtime through icinga but only through alertmanager [14:34:06] just had a look, they look ok (all they do extra is set the cluster in maintenance, and check the health of the cluster after each reboot) [14:34:29] I'm not sure if silencing am also requires the socks proxy (and hence global root to ssh to cumin1001) [14:34:51] they do use icinga [14:34:58] https://www.irccloud.com/pastebin/52s0j84f/ [14:35:01] then I looked too quickly :D [14:35:07] from the reboot cookbook (that they use internally) [14:37:20] I think all the reboot cookbooks use that [14:37:29] (am + icinga) [14:37:31] perhaps we could set a silence manually, and then Raymond_Ndibe could do the reboots with ssh or cumin... but we should also find a way around the icinga thing [14:37:58] because that's also blocking running the cookbook from cloudcumins [14:38:13] once we don't have icinga anymore we don't need to downtime there :) [14:38:37] that's probably the way :) [14:39:13] alertmanager using amtool through ssh to the alertmanager host to do silences though, don't you need global root too? [14:40:02] yes that's what I was asking (before realizing they also silence through icinga). that's the prod alertmanager host I imagine? [14:40:23] yep, same as icinga iirc (alert1001) [14:40:30] I see spicerack does it through the api now [14:40:32] though [14:40:39] we could try using that too [14:40:44] but not sure how the auth works [14:40:58] I should probably create a ticket about it [14:43:35] could it be that there's no auth at all? [14:43:40] (maybe just fw rules) [14:44:43] that would be nice, if that's the case we just have to add cloudcumins to the allowlist [14:44:52] and we could still use the socks proxy for local runs [14:46:08] oh, it's nginx, but yep, by ip/host [14:46:14] cloudcumin has no rights [14:46:18] but cumin does [14:47:30] the proxy currently tunnels through cumin1001 [14:47:35] so yep, should work there [14:47:50] not sure though if Raymond_Ndibe can ssh to cumin1001 [14:50:07] yes, but sudoers is very limited [14:50:36] while in cloudcumin1001 Raymond_Ndibe has full sudo rights [14:50:48] as long as he can ssh he can proxy :) [14:51:03] hmm true that [14:51:16] we still need to modify the cookbook though to use the API instead of Icinga [14:51:29] *instead of ssh, and also remove icinga :) [14:51:58] yes [14:52:11] the icinga part is kinda needed though, I think we get paged otherwise [14:52:29] (as in, it's needed while we don't have migrated completely yet) [14:52:53] hmm, I wonder if we can just set it to not page [15:04:49] dcaro I can ssh into cumin1001 but can't sudo [15:05:47] that should be enough I think, unless there's some custom proxying directive [15:21:57] * dcaro gtg [15:22:03] cya on monday! [15:23:50] o/ [15:24:22] o/ [15:33:16] arturo: hmm, there seems to be no trace in the harbor logs from the failed build run [15:33:32] strange! [15:35:27] let me run tcpdump to see if I can catch any traffic [15:37:06] I see TCP traffic in a normal fashion [15:37:22] this doesn't seem to be an IP connectivity problem? [15:38:39] what about this previous build log entry [15:38:41] step-copy-stack-toml: 2023-09-22T15:36:32.508518253Z 2023/09/22 15:36:32 warning: unsuccessful cred copy: ".docker" from "/tekton/creds" to "/tekton/home": unable to open destination: open /tekton/home/.docker/config.json: permission denied [15:39:09] do you have the same blancadesal ? [15:39:46] I do, but iirc this warning was there also in successful runs (could be mistaken) [15:40:08] let me check in actual toolforgeç [15:43:06] true, the warning is also on the live cluster. So, ignoring it [15:45:04] that should not be an issue [15:56:42] I see a very weird entry in the harbor logs [15:56:45] https://www.irccloud.com/pastebin/BL4Jk0Jp/ [15:57:22] suspiciously using the same address as the other POST from the builds-api. So I suspect that is the HTTP request that's failing [15:58:49] \x16\x03\x01: This typically represents the start of a TLS handshake. The 0x16 is the hexadecimal representation of the TLS record type for a handshake, and 0x0301 represents the TLS version (TLS 1.0). [15:58:58] (thanks chat GPT) [15:59:08] xd [15:59:35] xd [15:59:37] https://usercontent.irccloud-cdn.com/file/ralPwvXv/image.png [16:06:57] let the robots work https://usercontent.irccloud-cdn.com/file/isV89zjp/Screenshot%202023-09-22%20at%2018.06.24.png [16:13:39] blancadesal: I will continue next week. I run out of time for today [16:14:52] ok, let's call it a day [16:15:02] * arturo offline [17:55:09] That'd be the dockerconfig with the insecure registries entey