[07:56:03] hey folks! [07:56:38] as FYI I've increased vcores and memory for aux-ctrl and ml-ctrl (dse was already done), so we shouldn't get anymore pages due to kube-api reloading for TLS cert renewal [07:56:50] if we do, please check if there is something else going on [08:09:27] thanks! [13:52:44] based on the default kokkuri conf I see BUILD_VARIANT: run-py but I think k8s is looking for something else, perhaps "main" [13:53:26] o/ do you have a specific issue/failure? [13:53:46] yes, helmfile -e aux-k8s-eqiad -i apply fails [13:54:18] from /srv/deployment-charts/helmfile.d/aux-k8s-services/zarcillo on deploy1003 [13:54:47] okok, do you have a paste that we can check? [13:56:57] otherwise I can check manually, anyway "main" should be the helm release in theory [13:57:19] https://phabricator.wikimedia.org/P75311 ... then it times out in 10 m [13:57:31] the logging in helm is not all that useful [13:58:07] ahhhh okok [13:58:23] so if it times out after 10m it is probably the pod being scheduled but crashlooping [13:58:39] and after the configured timeout you get the rollback [13:58:49] indeed it rollsback due to the failure [13:58:56] 25s Warning Failed pod/zarcillo-main-6c967ccdbb-6wd99 Failed to pull image "docker-registry.discovery.wmnet/wikimedia/zarcillo:2025-03-25-091801-production": rpc error: code = NotFound desc = failed to pull and unpack image "docker-registry.discovery.wmnet/wikimedia/zarcillo:2025-03-25-091801-production": failed to resolve [13:58:57] reference "docker-registry.discovery.wmnet/wikimedia/zarcillo:2025-03-25-091801-production": docker-registry.discovery.wmnet/wikimedia/zarcillo:2025-03-25-091801-production: not found [13:59:01] (but not printing logs from the pod) [13:59:12] can i run kubectl myself ? [13:59:28] federico3: kube_env zarcillo aux-k8s-eqiad ; kubectl get events [13:59:41] and events are also in logstash ofc [13:59:46] let me give you a direct link [13:59:54] thanks [14:00:18] indeed I don't see zarcillo in https://docker-registry.wikimedia.org/ [14:01:27] federico3: https://logstash.wikimedia.org/goto/5abc10a3e0198608af996e936adde798 [14:01:41] that should lead you straight to kubernetes events for zarcillo [14:04:03] for future reference how did you find the URL? k8s_event.involvedObject.namespace = zarcillo as a filter? [14:05:29] the kubernetes event dashboard? It's linked from the home page [14:05:58] the filters, I applied on the spot by click first on the left panel to just filter out the aux-k8s-eqiad cluster (based on the master) [14:06:11] and then the namespace on the first zarcillo entry that showed up [14:13:12] hm, I used the build step from an example in https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/blob/devel/.gitlab-ci.yml?ref_type=heads#L23 but it seems not to upload the container [14:20:12] federico3: never done it, but maybe you are missing https://www.mediawiki.org/wiki/GitLab/Workflows/Deploying_services_to_production#Publishing_an_image_for_use_in_production [14:22:52] hw, this is not documented in the python template [14:23:02] ok thanks [14:23:41] np! [14:24:18] this cluster has no staging so this is effectively not real production [14:25:35] the aux cluster is meant for these experimental/new use cases, we may build a staging cluster in the future [14:26:49] is BUILD_VARIANT: production required? [14:30:35] that should be related to what you have defined in the blubber config IIUC [14:30:52] so https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/blob/devel/.pipeline/blubber.yaml?ref_type=heads [14:31:11] probably run-py in your case [14:39:51] oh wow I also need to manually add every project on k8s to https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/blob/main/projects.json?ref_type=heads ? [14:44:43] IIUC yes, the ones that will push to the registry [14:44:50] releng should appove etc.. [14:44:54] *approve [14:45:11] the trusted runners are the only gitlab runners able to push to the registy afaik [14:45:23] so we control what's being pushed etc.. [15:06:49] ah, the cost of doing everything out in the open. [15:44:25] hm, still blocked https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/jobs/489377 [15:54:01] it seems that there is a step that releng needs to do to allow zarcillo to use the trusted runners (after the merge) - https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/tree/main?ref_type=heads#request-access-to-trusted-runners [15:54:05] was it done? [16:25:05] (I don't know) [16:49:39] Yes, that happened in https://gitlab.wikimedia.org/repos/releng/gitlab-trusted-runner/-/jobs/489366 [16:49:58] which completed about an hour and 10 minutes ago. [16:51:41] If there was a job that attempted to publish before that, it won't work... and retrying it won't work. A new job (and therefore a new pipeline) needs to run. [16:53:01] federico3: ^^ [16:54:25] dancy: restarting the CI job is not enough? [16:54:38] No. [16:55:05] But feel free to try and confirm for yourself [16:58:07] I pushed a new commit and it seems it's not finding a runner yet https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/jobs/489512 [16:58:28] Hmm... taking a look [16:59:00] does the branch need to be protected from writes to enable the workers? [16:59:12] The branch does need to be protected. [17:01:46] btw you should be able to create a new pipeline without making a code change by using the blue "New pipeline" button at https://gitlab.wikimedia.org/repos/data_persistence/zarcillo/-/pipelines [17:08:41] ah, thanks