[07:52:50] sigh the docker registry issue is not resolved :( [08:07:15] one thing that I didn't get from https://phabricator.wikimedia.org/T390251#10695555 and https://phabricator.wikimedia.org/T390251#10695479 though is if the error now surfaces only when doing a docker pull, rather than curl (as the original report) [08:07:33] because in the former case it could be an issue with Dragonfly as well [08:07:57] the image gets uploaded, and for some reason it takes a bit before it is correctly pullable [08:08:03] but why now? What changed? [08:09:35] as Scott says "This is a Heisenbug of the most irritating kind" [08:10:01] I 'll do a scap deploy in 10 minutes or so to see if I can reproduce [08:10:46] I am thinking to add debug logging to nginx, what do you think? [08:10:57] it should be manageable, and maybe something good pops up [08:11:14] go ahead [08:11:19] I 'll wait for it [08:16:20] I tried it on registry1004, the error log gets filled with a lot of data, we'll have to monitor the growth of the logfiles [08:16:25] sending the patch [08:18:42] akosiaris: there are also all the reports added by Daniel in https://phabricator.wikimedia.org/T390251#10695580 - the errors are different, but it wouldn't hurt to just flip $scheme to "https" in our configs since it is pretty static [08:19:33] it sounds very improbable that this is the reason [08:19:53] oh yes I agree, but if you read the docker distribution reports you get very sad [08:20:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133069 is ready for a scap backport fwiw [08:20:11] they reports problems while pushing, but how the registry behaves is.. [08:20:14] as noop as it gets [08:20:21] lemme send the debug patch [08:21:07] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133070 [08:21:39] akosiaris: is the patch going to push another image to the registry? Pretty ignorant about the scap workflow [08:21:52] yes [08:22:08] I need to RT*M all the workflow :D [08:22:24] heh, if only that was easy [08:22:44] elukey: how long do we plan on having debug on? [08:23:03] maybe a couple of days? Just to catch another occurrence of the issue [08:23:11] ok then. Lemme +1 [08:26:26] running puppet [08:27:55] In the meantime I am trying to figure out how big docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-03-31-190719-publish-81 is [08:28:04] there must be a trigger that wasn't there before [08:30:06] akosiaris: nginx ready [08:45:31] elukey: sorry, emergency again :-(. I 'll deploy once I am back [08:50:58] sure np [09:33:30] Kid #2 with fever in school... [09:34:48] they start at 0 ? [09:34:53] * fabfur hides [09:36:26] It's an UBN right? I can take over deploying some stuff if needed while akosiaris takes care of their germ factory :P [09:38:14] elukey: tell me if I can help, anyways [09:40:06] I ll be in front of my computer in 10 [09:40:16] Germ factory taken care of [09:46:10] :) [09:46:35] claime: it is unclear yet if it is UBN, because somehow the bug disappears after a bit [09:46:43] I didn't get if they were able to deploy yesterday [09:46:44] augh [09:46:54] Some deploys ran yesterday [09:47:22] one thing that I am wondering is how images like docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug are built [09:47:27] because we haven't seen this in the past [09:47:36] so maybe something changed [09:48:23] They're built by https://gitlab.wikimedia.org/repos/releng/release/-/tree/main/make-container-image?ref_type=heads [09:48:53] I am around now [09:49:04] let me see if I 'll manage to backport successfully [09:49:09] akosiaris: wait a sec [09:49:21] I am trying to figure out if nginx uses debug now [09:49:27] * akosiaris waiting [09:49:34] because I don't see it in the error logs, it may require a full restart? [09:50:48] yes sigh [09:51:23] For debug logging to work, nginx needs to be built with --with-debug, see “A debugging log”. [09:51:24] ? [09:51:36] oh, it worked? [09:51:38] ok [09:51:57] works now [09:52:02] go ahead :) [09:52:48] it is built with --with-debug btw [09:52:56] clearly, just double checked with nginx -V [09:53:30] the error_log is now very verbose :) [09:54:08] starting [09:55:06] !log scap backport a noop change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1133069 for T390251 [09:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:08] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [09:57:10] 09:55:57 K8s images build/push output redirected to /home/akosiaris/scap-image-build-and-push-log [09:57:20] 09:57:10 Started sync-testservers-k8s [09:58:44] image being pulled, /me waiting [09:59:40] with our luck it will not surface [10:01:00] I don't think it will tbh [10:01:31] it already successfully pulled on a few nodes [10:01:46] wikikube-worker2050.codfw.wmnet Successfully pulled image "docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2025-04-01-095557-publish-81" in 3m31.999853508s 0s 1 mw-debug.codfw.next-867f965c4-xqgzc.18322985f13fb759 [10:02:29] sync-testservers-k8s step already at 50% [10:03:50] 83% already [10:04:20] 2 pods left, doubtful we 'll see it in this run :-( [10:08:42] 100%, progressed to the mwdebug hosts and is now checking testservers [10:08:56] I am continuing with sync [10:09:10] elukey: we didn't get lucky enough I 'd say [10:11:33] one thing that I noticed in the logs for yesterday is this [10:11:39] 2025/03/31 19:15:18 [error] 3845122#3845122: *38732 upstream prematurely closed connection while reading upstream, client: 10.64.0.212, server: , request: "GET /v2/restricted/mediawiki-multiversion-debug/blobs/sha256:128b91e8163d40642d2bdd410f8544bee05ee9cb6a28190d0eca8a79f5bd2e8c?ns=docker-registry.discovery.wmnet HTTP/1.1", upstream: "http://127.0.0.1:5000/[cut] [10:11:49] and that ip is a wikikube worker [10:11:56] there are just a few [10:12:09] but the sha256 is the one that is reported last in the task [10:12:20] is there something in the registry process's logs? [10:12:27] I didn't look into that yesterday [10:13:00] good point [10:15:00] there are things like [10:15:01] Mar 31 19:13:00 registry2004 docker-registry[2179572]: time="2025-03-31T19:13:00.513616598Z" level=error msg="response completed with error" err.code="blob unknown" err.detail="sha256:128b91e8163d40642d2bdd410f8544bee05ee9cb6a28190d0eca8a79f5bd2e8c" err.message="blob unknown to registry" [10:16:31] btw my /var/cache/nginx fs being full theory is debunked by the timing of https://phabricator.wikimedia.org/T390251#10695479 [10:16:45] hmmm [10:16:52] what does this even mean? [10:16:58] unknown to registry [10:20:35] ah no wait it is also referenced by Scott, it may be that the client checks if the layer is already there [10:20:41] I see a HEAD right after, with 404 [10:20:55] and after response completed [10:21:03] so there seems nothing weird so far [10:23:21] to recap - we haven't seen this issue before Friday for sure, and then it was reported in https://phabricator.wikimedia.org/T390251 [10:23:44] David had similar issues but only with scap/mediawiki deployments [10:23:54] this doesn't happen, so far, with any other image/layer [10:24:43] and it seems that we always have trouble with mediawiki-multiversion-debug [10:25:43] ah no [10:25:45] not that last part [10:25:54] we had failures with the -web image too [10:25:59] okok perfect [10:26:42] the nginx cache + partition full was surely something not good, and this is maybe why we see it not often? [10:26:44] restricted/mediawiki-webserver:2025-03-31-072141-webserver was the one that was failing yesterday EMEA morning [10:28:05] oncallers: I'm messing with routing for mobileapps/restbase atm - doing a slow rollout and everything is okay so far but if there's mobileapps/pcs-related noise that's me [10:29:50] akosiaris: I'll add a summary to the task, at this point we may have more luck if it re-happens with nginx debug logs [10:29:55] I can't think about anything else [10:30:14] elukey: I am doing so already as a comment [10:30:19] claime: too, please chime in if you have ideas [10:30:24] okok perfect [10:30:35] no clue [10:38:31] akosiaris: I added one though, namely that yesterday David tried multiple times to deploy and it consistently failed.. So maybe the corrupted cached played a role in exhacerbating the problem [10:38:41] now it pops up and goes away [10:39:07] that is frustrating as well, but IIUC subsequent deployments are not stuck [10:39:22] (to be confirmed, this is what I gathered from the SAL etc..) [10:48:34] * elukey lunch [13:02:07] akosiaris, claime - one thing that I haven't cleared up yesterday was the auth cache in nginx [13:02:26] so I am 80% convinced that it is not the issue, but 20% wants me to wipe the cache [13:02:44] it should really be a small perf penalty for the first reqs after the restart [13:03:43] basically /var/cache/nginx-auth/ [13:04:01] 14M, nothing big [13:18:28] also another deployment to mw-debug went through without issues [13:20:53] everything in the auth cache seems to be related to KEY: jwt /v2/repos/abstract-wiki/wikifunctions/function-orchestrator/blobs/uploads/Bearer [13:24:01] * Lucas_WMDE is apparently acting as an unwitting guinea pig for deployments? :D [13:29:41] Lucas_WMDE: never :P [13:30:14] jokes aside, we are seeing a sporadic issue when the wikikube workers pull new mediawiki-multiversion-{debug,web} images [13:30:42] the pods fail to come up due to some docker image layers not passing the sha256 check [13:30:49] (like they were corrupted) [13:30:54] auto-resolves [13:31:18] elukey: I have no idea how the auth cache would result in invalid digest blobs, but sure [13:31:45] yes me too, it is more about thinking out loud that real proof [13:32:40] elukey: yikes, good luck… :S [13:32:57] (I also did a security deploy earlier btw, in case that’s useful for counting successes) [13:33:08] oh yes it is useful thanks :) [13:33:32] I'll update the task about the auth cache idea [13:41:45] for some weird reason on 2005 error log and debug log seem stuck, 2004 is good [13:41:52] I tried to restart nginx on 2005 but nothing [13:46:53] elukey: it would be super surprising, but sure. [13:48:01] I've depooled 2005, not sure why nginx doesn't log anymore on error/debug log [13:49:00] what the.. [13:49:03] on 2004 works fine [13:58:16] after a reboot it works [13:58:31] oooof the root partition is full [13:59:41] ok the debug experiment is a no-go [13:59:43] reverting [14:11:23] ok done [14:50:54] gitlab build_ci_deb question: how do I go about negating "Job execution will continue but no more output will be collected." [14:51:00] I want to see the output [14:52:32] (yes, I did search for it but I can't seem to find how to do this the right way™ in our context) [15:01:37] I've not run into that [15:08:12] I think you might have hit a limit in output. Maybe it's a little too verbose? :P [15:08:21] yeah, this is building openssl [15:08:29] it passed but I would love to have the full output there [15:17:42] sukhe: probably a question to ask in #wikimedia-sre-collab as they own gitlab [15:18:10] (if it turns out wmf-debci can tweak something to store Moar Output then I can do that, but I suspect this is a gitlab thing) [15:24:03] Emperor: thanks, fair enough! [15:38:23] I just reimage/renamed elastic2055->cirrussearch2055 and the cookbook is stalling on `Nagios_host resource with title cirrussearch2055`...did I miss something in this PR? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132772 [15:43:27] Using sre.hosts.rename? [15:43:50] claime yeah, I ran the rename cookbook immediately before the reimage [15:44:59] with --new? [15:47:21] side comment: there are double pipes in that regex in site.pp: |089||090 [15:48:20] yeah, exact invocation was `sudo cookbook sre.hosts.reimage --os bullseye --use-http-for-dhcp --move-vlan --new cirrussearch2055 -t T388610 `...the host reimaged fine, just stuck on the nagios step [15:48:21] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [15:49:43] mutante good catch, I can fix that shortly [15:52:09] The nagios step (I think) could mean puppet doesn't apply cleanly and so doesn't export the resource [15:53:44] quite plausible; in any case I've had all sorts of weirdness when I screwed up the regex, so I'd start with fixing that and see if there still is a problem after that [15:53:46] yeah, your host isn't in puppetdb I think [15:54:33] you can put the host in "insetup" role in site.pp first.. then puppet should definitely work. then change the role in a second step [16:03:15] OK, CR up for fixing the regex if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133174 [16:04:21] inflatador: https://puppetboard.wikimedia.org/report/cirrussearch2055.codfw.wmnet/ffbe49a111f55c56d828819c943f5753740a7acb [16:05:01] it says Could not find class ::role::opensearch::cirrus [16:05:03] there [16:06:27] volans Good catch, let me fix that [16:08:34] regex does not have the || anymore. +1. still ends in |) but I guess that does nothing [16:08:45] yea, cirrus::opensearch vs opensearch::cirrus [16:18:11] mutante thanks again, I've got a patch to fix the role and remove that last pipe: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133177 [16:19:39] +1ed, lgtm! [16:21:28] once again I am in your debt good sir ;) [16:23:46] OK, starting reimage again, will check in ~45 or so [16:25:52] inflatador: is the cookbookl still polling? [16:26:12] also if not you can resume it with --no-pxe [17:16:44] back..thanks for the advice again. Looks like we stalled out at `Nagios_host resource with title cirrussearch2055 not found yet`. Will check puppet to see what I missed [17:18:33] https://puppetboard.wikimedia.org/report/cirrussearch2055.codfw.wmnet/f7bf6fb08efb9ccbed8a579af1432c94487d5a1b [17:19:27] ACK, , looks like we're not including the LVS config properly [17:50:49] sukhe: not sure if you know, but the ripe atlas now seems to run permanent measurements towards wikipedia "wellknown" target : https://atlas.ripe.net/probes/7508/results/wellknown [17:51:01] that's from our own anchor, thus the tiny latency [17:54:14] XioNoX: definitely did not know that, thanks for sharing! [17:54:28] ChrisDobbins901_: ^ might be handy for the geodns pipelining work as well [17:55:27] Thanks, XioNoX and sukhe! [18:30:40] getting some more failures related to puppet version: ` File "/srv/deployment/spicerack/cookbooks/sre/hosts/reimage.py", line 263, in _get_puppet_server [18:30:40] has_puppet7 = self.puppet_server.hiera_lookup(self.fqdn, "profile::puppet::agent::force_puppet7")` [18:32:01] This is a brand-new role, do I need to tell it to use Puppet 7 somewhere? I don't recall this being an issue with the cloudelastic and relforge roles (also brand-new), but maybe I missed something [18:32:30] I tried the `puppet.migrate-role` cookbook and it said: `No hosts found matching {self.role} still running puppet5` [18:32:56] inflatador: in the relevant hiera, you need to set that to true or false [18:33:25] like in hieradata/role/common/cirrus/cloudelastic.yaml, you have profile::puppet::agent::force_puppet7: true [18:34:39] sukhe thanks, it sounds like I have the wrong name for my hiera file then, similar to the role mistake I made earlier [18:34:55] hieradata/role/common/cirrus/cirrus.yaml should be hieradata/role/common/cirrus/opensearch.yaml (this file does have force puppet 7) [18:43:04] inflatador: what I usually do to verify that all is groovy before doing things like these is to verify that puppet compiles the catalog successfully. [18:43:23] sudo puppet lookup --compile --node "authdns_servers" on puppetserver1001 for example [18:43:33] you can replace authdns_servers with a key of your choice [18:43:48] sukhe ACK...it's tough because this is a brand-new role and PCC has no facts for the host [18:43:50] (if you pass --explain, it will explain how it looks up the key too) [18:43:59] sukhe: looks like it's working as expected : https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1&from=now-3h https://grafana.wikimedia.org/d/lU6QJQJnk/atlas-worldmap [18:44:01] inflatador: yeah those are the difficult ones, you have my sympathies! [18:44:24] sukhe Thanks. Y'all have been very helpful so no worries there [18:44:25] XioNoX: very nice! thanks! [18:44:47] this is the first host in a major migration, so we expected a few bumps ;)