[11:37:57] I have several k8s worker nodes stcuk on image downloads in codfw, anything I could look at registry-side to see what's going on? [11:38:58] <_joe_> https://wikitech.wikimedia.org/wiki/Docker-registry/Runbook looks like a good starting point [11:41:09] Mh. What hosts does the registry run on? [11:41:26] ah registryXXXX [11:42:30] <_joe_> klausman: what images are you failing to download? [11:42:43] docker-registry.discovery.wmnet/wikimedia/machinelearning-liftwing-inference-services-llm@sha256:fb1af91f38359ddb82da04fae3df5a21e7d8b091e203414809e865eb95eb6d73 is one of them [11:43:06] 1a2a93ab496b: Downloading [==========> ] 446MB/2.038GB [11:43:39] The hiost is just sitting there. I have seen timeouts as well (pod has state ErrImagePull or ImgPullBackoff) [11:45:40] It seems to occasionally make progress, but it's extremely slow. taking O(45m) for 550M so far, then a spurt to 619M, and it hangs again [11:46:53] <_joe_> I would assume it has something to do with layers being larger than the in-memory cache of nginx [11:47:05] <_joe_> check the memory usage on the servers [11:48:51] Both reg2003 and 4 have several G of buffers/cache used, so I don't think they're under memory pressure [11:49:37] I vaguely remember k8s using Dragonfly rhese days, wouldn't hat bypass nginx? [11:49:55] <_joe_> 1) no 2) it's only used for mediawiki images IIRC [11:50:03] Ah, I see [11:50:04] <_joe_> jayme: ^^ is that still the case? [11:50:52] <_joe_> but yeah, I'd rather look at swift to ensure things are ok first [11:51:00] <_joe_> then bubble up to the registry [11:52:59] Oh, whole pile of server errors in grafana [11:53:11] account/GET and container/GET [11:54:00] Starting around 1000 UTC, which is vaguely correlated with me doing rolling drains/restarts of our k8s nodes for runc updates [11:54:27] And ATS has been sending 500s to Swifty? But I don't know the meaning of that [12:09:44] klausman: that's not how big O notation works, glad to be helpful, you're welcome :D [12:10:07] I knwo, but it's still shorter to thype than "on the order of" [12:10:20] * kamila_ is channeling their inner kormat and can be safely ignored [12:15:36] kamila_: tihi [12:16:53] claime: you of all people? :D [12:23:33] _joe_: yes and no. dragonfly is used regardless of the image but it's only deployed on wikikube [14:52:10] Do pods need a network policy to be able to reach out to other pods? I have a service that can't reach out to a memcached port (either via the service cluster IP or the actual pod IP), and I'm trying to understand why. Thanks! [14:54:03] *reach out to a memcached pod [15:14:04] brouberol: iirc they do yeah [15:20:09] brouberol: actually they don't. There is a dreaded, historic pod-to-pod network policy (grep in admin_ng) that allows Egress to the pod networks [15:20:18] you'd still have to allow ingress, though [15:21:03] but I would argue that it would be nice to be explicit in the network-policy and dont rely on pod-to-pod ... we should get rid of that rule [15:26:37] jayme: once again, I owe you. I was missing an ingress network policy [15:43:48] yw ;)