[11:37:57] <klausman>	 I have several k8s worker nodes stcuk on image downloads in codfw, anything I could look at registry-side to see what's going on?
[11:38:58] <_joe_>	 https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook looks like a good starting point
[11:41:09] <klausman>	 Mh. What hosts does the registry run on?
[11:41:26] <klausman>	 ah registryXXXX
[11:42:30] <_joe_>	 klausman: what images are you failing to download?
[11:42:43] <klausman>	 docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-llm@sha256:fb1af91f38359ddb82da04fae3df5a21e7d8b091e203414809e865eb95eb6d73 is one of them
[11:43:06] <klausman>	 1a2a93ab496b: Downloading [==========>                                        ]    446MB/2.038GB
[11:43:39] <klausman>	 The hiost is just sitting there. I have seen timeouts as well (pod has state ErrImagePull or ImgPullBackoff)
[11:45:40] <klausman>	 It seems to occasionally make progress, but it's extremely slow. taking O(45m) for 550M so far, then a spurt to 619M, and it hangs again
[11:46:53] <_joe_>	 I would assume it has something to do with layers being larger than the in-memory cache of nginx
[11:47:05] <_joe_>	 check the memory usage on the servers
[11:48:51] <klausman>	 Both reg2003 and 4 have several G of buffers/cache used, so I don't think they're under memory pressure
[11:49:37] <klausman>	 I vaguely remember k8s using Dragonfly rhese days, wouldn't hat bypass nginx?
[11:49:55] <_joe_>	 1) no 2) it's only used for mediawiki images IIRC
[11:50:03] <klausman>	 Ah, I see
[11:50:04] <_joe_>	 jayme: ^^ is that still the case?
[11:50:52] <_joe_>	 but yeah, I'd rather look at swift to ensure things are ok first
[11:51:00] <_joe_>	 then bubble up to the registry
[11:52:59] <klausman>	 Oh, whole pile of server errors in grafana
[11:53:11] <klausman>	 account/GET and container/GET
[11:54:00] <klausman>	 Starting around 1000 UTC, which is vaguely correlated with me doing rolling drains/restarts of our k8s nodes for runc updates
[11:54:27] <klausman>	 And ATS has been sending 500s to Swifty? But I don't know the meaning of that
[12:09:44] <kamila_>	 klausman: that's not how big O notation works, glad to be helpful, you're welcome :D
[12:10:07] <klausman>	 I knwo, but it's still shorter to thype than "on the order of"
[12:10:20] * kamila_ is channeling their inner kormat and can be safely ignored
[12:15:36] <claime>	 kamila_: tihi
[12:16:53] <kamila_>	 claime: you of all people? :D
[12:23:33] <jayme>	 _joe_: yes and no. dragonfly is used regardless of the image but it's only deployed on wikikube
[14:52:10] <brouberol>	 Do pods need a network policy to be able to reach out to other pods? I have a service that can't reach out to a memcached port (either via the service cluster IP or the actual pod IP), and I'm trying to understand why. Thanks! 
[14:54:03] <brouberol>	 *reach out to a memcached pod 
[15:14:04] <claime>	 brouberol: iirc they do yeah
[15:20:09] <jayme>	 brouberol: actually they don't. There is a dreaded, historic pod-to-pod network policy (grep in admin_ng) that allows Egress to the pod networks
[15:20:18] <jayme>	 you'd still have to allow ingress, though
[15:21:03] <jayme>	 but I would argue that it would be nice to be explicit in the network-policy and dont rely on pod-to-pod ... we should get rid of that rule
[15:26:37] <brouberol>	 jayme: once again, I owe you. I was missing an ingress network policy
[15:43:48] <jayme>	 yw ;)