[10:03:54] <elukey>	 refined the patch https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1215098
[10:04:12] <elukey>	 so now it should be possible to use ingress in kartotherian staging, and direct traffic to it 
[10:04:20] <elukey>	 lemme know if it makes sense
[15:46:38] <btullis>	 Hi. I've got another one of these persistent `blob upload invalid` errors from the docker-registry: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693588
[15:47:14] <btullis>	 Is there some action I can take to try to mitigate it?
[15:50:25] <elukey>	 btullis: lemme check on the registry side, but the first thing that comes to mind is to verify how big the docker layers are (of that image)
[15:52:31] <elukey>	 are those the spark base images?
[15:53:37] <btullis>	 Thanks. It's been working up to yesterday. These are the new spark images, made by GitLab-CI, instead of production-images. 
[15:53:47] <btullis>	 https://docker-registry.wikimedia.org/repos/data-engineering/spark/tags/
[15:55:12] <btullis>	 It looks like the biggest layer of the most recent one to be published is just over 1GB. I can't think that I added anything in the two recent MRs (that didn't publish), which might have increased the layer size.
[15:55:42] <btullis>	 *increased the layer size dramatically.
[15:59:18] <elukey>	 okok perfect, then it seems a eventual consistency issue on swift
[15:59:30] <elukey>	 so I can see only this for registry2005
[15:59:31] <elukey>	 Dec 04 14:44:27 registry2005 docker-registry[608]: time="2025-12-04T14:44:27.084753815Z" level=info msg="response completed" go.version=go1.19.8 http.request.contenttype=application/vnd.docker.distribution.manifest.v2+json http.request.host=docker-registry.discovery.wmnet http.request.id=04cc3834-3599-416a-a5dd-91a4c45ec224 http.request.method=PUT http.request.remoteaddr=10.192.29.6 
[15:59:31] <elukey>	 http.request.uri=/v2/repos/data-engineering/spark/spark3.4-history/manifests/2025-12-04-144417-b013c7f45971a3ba48c17ddc3d326533599b18cf http.request.useragent=buildkit/v0.26 http.response.duration=308.890026ms http.response.status=201 http.response.written=0
[16:00:11] <elukey>	 that should match (remote addr gitlab-runner2003.codfw.wmnet)
[16:00:47] <elukey>	 IIRC we had this issue with the mediawiki images rebuild, when the fetch of the new layers/metadata happened to fast (so swift wasn't fully consistent yet)
[16:01:16] <elukey>	 btullis: have you seen more of these recently? What happens if you kick off another run?
[16:02:28] <btullis>	 I have seen two: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693588 and previously: https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693467
[16:03:29] <btullis>	 I merged a patch between the two jobs, but I don't know if it changed the problematic layer. I'm happy to kick off a rerun now, if you like.
[16:03:30] <elukey>	 I see that you also posted to https://phabricator.wikimedia.org/T406392, wasn't aware of it
[16:03:40] <elukey>	 btullis: let's try, I am curious
[16:03:53] <btullis>	 Ack. 
[16:04:28] <btullis>	 https://gitlab.wikimedia.org/repos/data-engineering/spark/-/jobs/693618
[16:06:43] <elukey>	 not much luck
[16:06:56] <btullis>	 It failed very quickly, so there is no way that it compiled spark. It must have used a cached layer.
[16:09:38] <elukey>	 yeah I agree
[16:09:59] <elukey>	 so my theory it is that the cached layer is still non consistent among all swift nodes
[16:10:21] <elukey>	 and it may solve like https://phabricator.wikimedia.org/T406392#11252254 after some time passes 
[16:10:28] <elukey>	 that is not a great strategy I know
[16:10:41] <elukey>	 we should really move the registry to apus and drop swift
[16:11:22] <btullis>	 OK, FYI there was a 6 hour gap between the first two failed runs.
[16:28:36] <elukey>	 posted a msg to both tasks, I really think that we need to form a working group to transition us away from swift for good