[08:00:49] dcaro: thanks for the weekend maintenance [08:01:40] 👍 [08:07:34] dcaro: I want to merge and deploy the maintain-kubeusers refactor today. If you don't have any last minute review, I'll go for it soon [08:08:46] arturo: do you want me to do another review before merging? [08:09:39] on my side I'm fine [08:10:45] blancadesal: I recommend deploying and merging the pre-commit/poetry update MRs one by one, otherwise the toolforge-deploy MRs pile up and block any other change from being merged [08:10:50] (on that same component) [08:11:12] I have additional patches for you to review, if you want: https://gerrit.wikimedia.org/r/c/cloud/wmcs-cookbooks/+/1037821 these should be simpler to review [08:11:43] dcaro: ok, I'll start deploying [08:50:11] heads up I'm finally applying the Redis timeout change (T363709) after finding out how to do it [08:50:12] T363709: [toolforge] Redis refusing connections - https://phabricator.wikimedia.org/T363709 [08:55:09] ack [08:59:04] * dcaro paged [08:59:06] redis [08:59:07] xd [08:59:15] ^^U [08:59:28] cc dhinus [08:59:35] looking [08:59:42] dhinus: I acked, let me know if I can help [08:59:52] seems it was a flap? [08:59:55] it might be temporary, I hoped it to be smoother [08:59:56] (incident resolved, maybe there was a restart and a lost ping from toolschecker) [09:00:13] there was a failover from the primary host [09:00:19] ack [09:00:30] is that a cookbook? (if so would be nice to add a silence) [09:00:52] master_link_status:down [09:01:01] so maybe something is still wrong [09:02:24] toolschecker says ok http://checker.tools.wmflabs.org/redis, so clients have access [09:03:26] master_link_status:up [09:03:44] the output of "redis-cli info replication" is now looking fine on all 3 hosts [09:05:41] dcaro: it's not a cookbook, it might be worth creating one [09:08:02] this was what I did: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Redis#Puppet_configuration [09:09:05] dhinus: nice [09:18:55] can I get a quick +1 here? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1038227 [09:26:50] arturo: hmm will it work though? docker is quite picky about using non-https connections [09:27:57] no idea! [09:28:24] the https version also has the same proxy to http://registry [09:28:50] actually, this module is from prod registry, so I assume it is working :-P [09:29:06] let's try [09:29:13] usually docker client will refuse to push to http endpoints unless explicitly specified in the daemon.json as insecure registries [09:29:27] (we have that issue with harbor on lima-kilo) [09:29:34] dcaro: that's what I was thinking [09:29:55] yeah, but the client will sill use HTTPS [09:30:06] this is just http between nova proxy and the registry [09:30:23] client --> HTTPS --> nova proxy --> HTTP --> registry [09:30:23] I suspect docker server will return a redirect though [09:30:35] which might mess things up, but we can try [09:30:49] hmm... how is it working now? using a floating IP? [09:30:53] proxy_redirect off; [09:31:00] (as in, docker server is running ssl itself?) [09:31:09] *docker registry server [09:31:31] well, the diagram above is actually something like this: [09:31:49] client --> HTTPS --> nova proxy --> HTTP --> local nginx --> HTTP --> registry [09:31:55] with the current setup being [09:32:05] client --> HTTPS --> local nginx --> HTTP --> registry [09:32:19] hmm in that case it might work [09:32:32] is there any security considerations there? (as in nova proxy -> local nginx can be sniffed by anyone?) [09:33:24] is the same security level as with any other nova-proxy setup [09:34:04] in this case, with the content being container images [09:34:13] and the docker login [09:34:44] unless we are pushing without login, that would not be very good I think :/ [09:35:20] I see password stuff in the puppet code, so we should have login [09:36:40] the nova-proxy TLS backend support is well documented, but fixing that is beyond the scope of my change [09:37:04] the lack of TLS backend support* [09:37:51] and that's ok, does that mean that we should not use nova-proxy for sensitive information? [09:39:25] as of today, nova-proxy is the canonical way to expose HTTPS services over the internet. We actively encourage users to use it, with a disclaimer about this limitation [09:41:02] while I agree the ideal would be HTTPs everywhere, arbitrary network sniffing has been proven impossible to do multiple times in Cloud VPS openstack, including by external security evaluators [09:41:33] so I will be in favor of having everything on HTTPS [09:41:50] but again, I think this discussion is out of the scope for the change I'm proposing [09:42:03] ^ awesome, then it's ok(ish) for the link between nova-proxy<->docker image registry to be http for now no? [09:42:21] Unless you think that it's a blocker for that change, then we should wait until the https backend link support is there [09:43:26] I will be happy to further discuss with andrew and taavi, both should have way more knowledge about the nova-proxy setup than I do [09:43:47] xd, /me feels like none of the questions being asked are being answered [09:44:18] I have the same information as you [09:45:53] then you don't know either? [09:47:26] what question do you think needs to be answered? [09:48:55] "^ awesome, then it's ok(ish) for the link between nova-proxy<->docker image registry to be http for now no?" and the implicit "Unless you think that it's a blocker for that change, then we should wait until the https backend link support is there" [09:49:22] that I can make explicit "Do you think that having http between nova-proxy and the docker registry is a blocker for the change?" [09:56:09] I think the change is fine. I don't have any concern about the change. I don't think the change has any blocker. [09:58:27] awesome, +1 then [09:59:59] I agree, it's probably fine to push it, even if it makes security slightly worse compared to the current setup [10:01:39] but as you say, securing the traffic from the proxy to the backends is a different problem that we should address separately from this change [10:01:48] well, I lost interest, so I wont be merging it. It was just a cosmetic change after all [10:06:39] :( if nothing changes in the setup feel free to use the +1 there if you regain your interest [10:26:07] Any concerns for me changing the docker image names for ci images from `cloud-cicd-py311bookworm-tox` to `cloud-cicd-py3.11-bookworm-tox` so it matches the name of the directory it's created from? [10:30:18] 👍 [10:30:31] sgtm [10:31:29] where is the directory name coming from? I thought we used "py311" without the dot in most places, but I'm fine with both [10:32:22] I guess it was created like that https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/tree/main?ref_type=heads [10:32:53] I'm ok changing the directory instead, but that might mean changing all the gitlabci.yaml from all the repos that use it (as the yaml files are included from those) [10:33:04] no need, I was just curious :) [10:36:34] 👍 [10:37:54] hmm I wonder if a dot in the name can cause issues though, a random search led me to https://github.com/docker/docker-py/issues/294 [10:38:52] Got https://gitlab.wikimedia.org/repos/cloud/cicd/gitlab-ci/-/merge_requests/40 [10:39:15] dhinus: interesting, that seems like an issue yes xd [10:39:26] I was able to push to the registry without problems, let me try pulling [10:39:59] no issues ¯\_(ツ)_/¯ [10:40:00] the official docs seem to say dots are fine https://docs.docker.com/reference/cli/docker/image/tag/ [10:40:26] dhinus: can you try `docker pull docker-registry.tools.wmflabs.org/cloud-cicd-py3.9-bullseye-tox:latest` just in case? [10:40:26] some client libraries might differ, but I'd say let's keep it this way until we encounter any issue [10:40:57] the command works fine on my machine [10:41:53] thanks, seems ok to go with it then [10:44:09] +1 [10:45:08] k8s seems ok pulling that too ` Normal Pulled 4s kubelet Successfully pulled image "docker-registry.tools.wmflabs.org/cloud-cicd-py3.9-bullseye-tox" in 14.228816905s` [10:47:15] * dhinus lunch [10:49:20] * dcaro lunch [12:27:34] heads up: I will be merging and deploying the new maintain-kubeusers version now [12:28:06] this is a deployment and rollback plan: [12:28:07] https://phabricator.wikimedia.org/T364312#9854509 [12:28:36] ack [12:29:46] 🚢 [13:04:24] working on the redis hosts I noticed some prometheus-related errors in the logs. I can see the redis metrics in grafana so they're probably not important errors, but I created T366471 [13:04:25] T366471: [toolforge] [redis] Prometheus exporter logging errors - https://phabricator.wikimedia.org/T366471 [14:01:55] we need this emergency fix for maintain-kubeusers [14:01:55] https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/merge_requests/27 [14:02:26] cc dcaro [14:03:54] arturo: the resource quota also needed renaming? [14:04:18] good question, I didn't check and assumed everything was a mistake [14:04:21] let me double check [14:04:55] dcaro: yes, same [14:04:59] ack [14:07:15] the functional tests are now not showing that error on my system [14:07:34] functional tests seem to be passing now, +1 [14:07:46] mine pass too :) [14:08:13] thanks for the assistance! [14:08:19] merging and deploying now [14:13:24] deployed ✅ [14:17:31] dcaro: looks like you encountered https://phabricator.wikimedia.org/T366357 [14:19:08] andrewbogott: yep :) [14:20:31] I manually deleted the livenessprove of the maintain-kubeusers deployment in tools, so it can finish the first iteration without being killed by the healthchecker [14:20:40] I'll go grab some food, I'll miss the checkin [14:20:56] thanks for the deployment! [14:24:51] andrewbogott: about the secrets on puppetservers, did I mess anything up? (the hook is still disabled, and puppet does not seem to enable it again) [14:27:02] dcaro: I think I'm missing context, is that in the backscroll? [14:27:20] andrewbogott: sorry, we can have a quick chat in the sync [14:27:29] 'k [15:58:08] FYI maintain-kubeusers finished the loop, I'm reenabling the liveness check [15:59:12] \o/ [16:05:59] * arturo offline