[07:16:23] good morning folks [07:17:01] If nobody opposes I'll rollout the new changenprop (nodejs20 bump, librdkafka bump minor bump) in codfw [07:17:50] the new version was tested by Aaron locally and in Beta, and of course in k8s staging [07:22:00] good luck! [07:33:53] akosiaris: ahahaha thanks for the reassuring +! [07:33:54] +1 [07:34:17] ;-) [07:51:23] <_joe_> elukey: to clarify: are you doing changeprop first, then the jobqueue? [07:54:12] _joe_ exactly yes, this is my idea, after people finishes to deploy [07:54:39] but with a slow pace, there is really no rush, ideally I'll leave changeprop-codfw for some hours before doing jobqueue [07:54:46] I can even do it tomorrow [07:54:54] <_joe_> no i mean [07:55:15] <_joe_> let's do all of changeprop today, move to jobqueue in a couple days? [07:55:24] <_joe_> just to see if something burns [07:56:23] ah ok sorry, you mean changeprop codfw,eqiad today, let it bake and then jobqueue tomorrow/wed [07:56:30] okok yes +1 [08:12:48] oncall people: A:cp-magru hosts now are using TLS certificates from volatile storage [08:13:04] fabfur: Thanks [09:21:55] Hi elukey and fabfur [09:22:06] hullo! [09:22:11] I have found the issue about the webrequest-source value in the live smapled data [09:22:21] I'll submit a patch in minutes [09:22:22] thanks! [09:22:23] :) [09:23:02] nice [09:23:07] joal: <3 [09:23:25] let us know, if I have to adapt the dashboard I will ofc, but even better if not needed :) [09:27:17] No change on your side, just some data that was incorrectly parsed [09:27:43] ack, thx [09:55:34] I confirm the fix is being rolled-out and works as expected: https://w.wiki/DeYM [09:56:26] <_joe_> joal: <3 [09:56:40] 👍 [10:03:13] joal: nice, what is frontend there? used to be just upload and text [10:03:17] is it the sum? [10:18:34] oh I see we're back with just upload and text, probably just some overlapping time all three metrics where there [10:50:31] for on-callers awareness: mw deployments are currently stuck becaues of https://phabricator.wikimedia.org/T390251 [10:50:58] me and Alex checked this morning and it seems nginx-related [10:51:13] now, I filed https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132582 that I think it may be a good lead to improve the situation [10:51:34] I don't really think that a 10g layer cache is so much problematic performance-wise for the registry [10:51:59] since we have Dragonfly for the blob-cache heavy lifting [10:52:49] lemme know your thoughts [10:55:17] I've set https://phabricator.wikimedia.org/T390251 even if it should probably be Unbreak now [10:55:22] *to high [11:56:04] disabled nginx blob cache for all the registry hosts, so far the inconsistency seems gone [11:56:10] but another deployment will tell [11:56:22] getting lunch now :) [12:20:32] fy, david just confirmed in -operations that eluke.y's patch worked. [12:20:36] fyi* [12:33:54] 👍 [13:11:39] <_joe_> uhm wasn't daniel supposed to be overriding for arnold? [13:25:28] It's done weird. It will switch to Daniel in 30 minutes. [13:25:52] 35 minutes. [13:55:16] on-callers / service-ops - to complete the docker registry task mentioned earlier on, I'd rm -rf /var/cache/nginx-docker-registry on registry*, so we clean up the corrupted content etc.. (and we get more space on the root partition) [13:55:23] thoughts? [13:55:40] (the directory is not referenced in the nginx config anymore) [14:38:54] <_joe_> elukey: go [14:39:08] <_joe_> it's a cache, sorry I was in a meeting :/ [15:02:21] done! [18:18:18] I've run `depool ` twice on elastic1068, but it still shows up in https://config-master.wikimedia.org/pybal/eqiad/search-https and i still get connections to it when fetching https://search.svc.eqiad.wmnet/, why might that be? [18:22:43] * ebernhardson realizes it's because i wanted `depool 'all services'` [18:25:47] ebernhardson: and it has a parameter to specify the systemd service name, right? is that what you used then? [18:31:27] mutante: yea, i suppose imo there are two problems: 1) when run as not-root it doesn't emit any error messages, and 2) When specifying a service that doesnt exist, again no error messages [18:33:40] ebernhardson: gotcha, yea. the other alternative is to run conftool depool --hostname .. from a central place instead of the pool/depool command on the host itself [18:34:23] conftool depool --hostname .. --service ... [18:37:12] mutante: the one difference between "depool from host" vs central place (which is where SREs do it, puppetserver or cumin) is that non-SREs don't have access to those but can run the depool script from the host locally, assuming they have access within that hsot [18:38:17] yea, for me i have sudo to root on elastic*, but not in general [18:39:24] ACK, I see. yea [18:39:51] ebernhardson: but you are good for now with the actual depool then? [18:40:09] yea once i realized how to make it work :) `sudo depool` did the trick [18:40:17] ok [18:40:44] i suppose i use it so rarely, i was thinking it worked like the puppet disable script with a reason [19:26:42] Hi SRE folks! We've run into the registry problem again: https://phabricator.wikimedia.org/T390251 [19:27:24] Deployments are blocked in the meantime. [19:30:44] taking a look - this is immensely puzzling [19:32:48] does this mean we should do an nginx restart for the short-term unblock? [19:33:08] so, the cache should now be disabled [19:33:51] I'm going to pull the blob from both registry2004 and 2005, for each of multiversion and multiversion-debug [19:37:33] swfrench-wmf: dancy: serverfault has a "unknown blob error in private docker registry under nginx reverse proxy" [19:37:39] specifically with nginx [19:38:45] no good answers but yea, they say it all worked fine until they moved their registry behind nginx [19:40:14] there is more! https://github.com/distribution/distribution/issues/2746 [19:40:42] could it be client_max_body_size [19:42:33] very interesting! [19:42:54] so, unfortunately(?) all of these blobs appear to be "good" now: https://phabricator.wikimedia.org/T390251#10695555 [19:43:38] Upsetting. [19:44:41] I'll restart the sync and see what happens. [19:45:08] fwiw: client_max_body_size is already set to 0 on registry config [19:46:49] https://nishtahir.com/fixing-the-unknown-blob-error-apache-private-docker/ [19:47:41] mutante: thank you for checking [19:49:04] btw did we stop using that dragonfly image distribution stuff? [19:49:20] dragonfly should still be enabled [19:49:57] Where does it fit into the mix with nginx and the registry? [19:51:07] my understanding is that it sits "between" (so to speak) the k8s nodes and the registry [19:51:22] and by registry, I mean nginx [19:51:31] nod. Seems like it may be something to consider blaming. [19:52:00] or suspecting [19:52:03] true, though IIRC you were able to pull a bad blob with curl last week, right? [19:52:14] Indeed. Good point. [19:56:10] if there was some bug where intermittently it sets the wrong $scheme in the X-Forwarded-Proto header.. and speaks http instead of https that could explain how it's garbled completely.. and others claim it fixed it for them to hard set it to "X-Forwarded-Proto https;" [19:56:25] our nginx template here has: proxy_set_header X-Forwarded-Proto $scheme; [20:16:58] I've checked the docker-registry logs on registry2004 and 2005, and this all looks entirely normal to me ... unlike the instance of this last Thursday, no upload errors