[10:03:54] refined the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1215098 [10:04:12] so now it should be possible to use ingress in kartotherian staging, and direct traffic to it [10:04:20] lemme know if it makes sense [15:46:38] Hi. I've got another one of these persistent `blob upload invalid` errors from the docker-registry: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693588 [15:47:14] Is there some action I can take to try to mitigate it? [15:50:25] btullis: lemme check on the registry side, but the first thing that comes to mind is to verify how big the docker layers are (of that image) [15:52:31] are those the spark base images? [15:53:37] Thanks. It's been working up to yesterday. These are the new spark images, made by GitLab-CI, instead of production-images. [15:53:47] https://docker-registry.wikimedia.org/repos/data-engineering/spark/tags/ [15:55:12] It looks like the biggest layer of the most recent one to be published is just over 1GB. I can't think that I added anything in the two recent MRs (that didn't publish), which might have increased the layer size. [15:55:42] *increased the layer size dramatically. [15:59:18] okok perfect, then it seems a eventual consistency issue on swift [15:59:30] so I can see only this for registry2005 [15:59:31] Dec 04 14:44:27 registry2005 docker-registry[608]: time="2025-12-04T14:44:27.084753815Z" level=info msg="response completed" go.version=go1.19.8 http.request.contenttype=application/vnd.docker.distribution.manifest.v2+json http.request.host=docker-registry.discovery.wmnet http.request.id=04cc3834-3599-416a-a5dd-91a4c45ec224 http.request.method=PUT http.request.remoteaddr=10.192.29.6 [15:59:31] http.request.uri=/v2/repos/data-engineering/spark/spark3.4-history/manifests/2025-12-04-144417-b013c7f45971a3ba48c17ddc3d326533599b18cf http.request.useragent=buildkit/v0.26 http.response.duration=308.890026ms http.response.status=201 http.response.written=0 [16:00:11] that should match (remote addr gitlab-runner2003.codfw.wmnet) [16:00:47] IIRC we had this issue with the mediawiki images rebuild, when the fetch of the new layers/metadata happened to fast (so swift wasn't fully consistent yet) [16:01:16] btullis: have you seen more of these recently? What happens if you kick off another run? [16:02:28] I have seen two: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693588 and previously: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693467 [16:03:29] I merged a patch between the two jobs, but I don't know if it changed the problematic layer. I'm happy to kick off a rerun now, if you like. [16:03:30] I see that you also posted to https://phabricator.wikimedia.org/T406392, wasn't aware of it [16:03:40] btullis: let's try, I am curious [16:03:53] Ack. [16:04:28] https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693618 [16:06:43] not much luck [16:06:56] It failed very quickly, so there is no way that it compiled spark. It must have used a cached layer. [16:09:38] yeah I agree [16:09:59] so my theory it is that the cached layer is still non consistent among all swift nodes [16:10:21] and it may solve like https://phabricator.wikimedia.org/T406392#11252254 after some time passes [16:10:28] that is not a great strategy I know [16:10:41] we should really move the registry to apus and drop swift [16:11:22] OK, FYI there was a 6 hour gap between the first two failed runs. [16:28:36] posted a msg to both tasks, I really think that we need to form a working group to transition us away from swift for good