[08:26:48] Emperor: o/ [08:27:28] thanks for the follow up in the docker/swift task - I am very ignorant about the replication part, could you please add a few more info about how it is usually done? [08:28:15] namely, what does swift expect from the service to make replication working [08:36:26] elukey: everything - in ms, the two clusters are entirely separate, it's 100% on the user/operator/application to copy things between the two clusters [08:36:37] codfw ms knows nothing of eqiad ms [08:37:28] to put it another way, if there was replication going on, it wasn't swift doing it [08:43:28] okok that is clear, I am not understanding what the SyncTo flag listed by swift stat mean though [08:44:11] like Sync To: //docker_registry/eqiad/AUTH_docker/docker_registry_codfw [08:49:48] I don't know is the short answer [08:51:32] okok super, I didn't mean to point the finger to you, I am just trying to brainbounce since I have no idea either :D [08:52:04] huh, TIL we have some container-sync-realms.conf set up! [08:54:03] T214289 is entirely news to me (from 2019) [08:54:04] T214289: Make swift containers for docker registry cross replicated. - https://phabricator.wikimedia.org/T214289 [08:56:30] elukey: this is my sad face [08:58:33] :D [08:58:48] need to commute to the office, will be back in a bit [08:59:09] https://docs.openstack.org/swift/latest/overview_container_sync.html are the upstream docs [09:19:28] ok so that explains what I am seeing in swift stat, nice [09:19:39] at the end I'll add some documentation :D [09:20:40] we should stop doing this, it violates a whole pile of my assumptions about our various object stores. But that doesn't help right now [09:21:23] Emperor: the plan is to move the registry away from swift, it will take a bit but hopefully in Q3/Q4 we'll be able to do it [09:21:48] so if you don't mind keeping it for the time being, it will hopefully go away [09:21:59] I don't even know if I can fix it right now [09:22:47] we can probably start figuring out if there is any trace of logs or similar indicating what's wrong, and then we decide how bad it is [09:23:08] it started to break from a specific date, so we may also check what it was done in swift at the time [09:23:18] maybe a specific thing/refactor/etc.. triggered it [09:23:19] I'm starting by trying to find the logs [09:23:27] I'll try to help as well [09:24:09] looks like 26 servers have at least one log entry related [09:25:44] no, those are all commons objects with sync in the name [09:27:53] That's not good. Pick a random backend, systemctl status swift-container-sync.service and : 'Assert: start assertion failed at Thu 2025-10-23 10:28:20 UTC; 1 months 25 days ago' [09:29:13] OK, but it's running elsewhere (and those asserts look like a setup race with ring deployment) [09:29:36] ah ok so there is a daemon running [09:30:39] looking at swift-container-sync(1) it should be the host(s) where the container db is located that actually do the sync, so next step is to find those [09:31:57] and since it's presumably replication _from_ codfw that's borked, let's try there. [09:32:27] the documentation also lists INFO as debug, we may also try to flip that to debug on a test host to see if it leads to more data [09:33:48] maybe ms-be2068.codfw.wmnet ? [09:34:04] We need to find which 3 nodes are meant to be doing the sync first [09:34:10] ms-be2081.codfw.wmnet [09:34:17] this one has errors [09:35:16] (#012Exception: Unknown exception trying to GET: 'AUTH_docker' [09:35:31] also ms-be1066.eqiad.wmnet has logs docker-related [09:35:42] same for ms-be1076.eqiad.wmnet [09:35:50] and ms-be1087.eqiad.wmnet [09:36:31] !log restart swift-container-sync on ms-be2081 T413008 [09:36:32] Emperor: Not expecting to hear !log here [09:36:32] T413008: Cross-datacenter Docker Registry replication broken since 2025-04-27 - https://phabricator.wikimedia.org/T413008 [09:37:40] so on restart it complains about not being able to talk to a local memcached and then immediately can't sync [09:39:14] should there be a memcached server? I don't see any [09:39:32] https://phabricator.wikimedia.org/P86729 [09:39:40] I think that's likely a red herring [09:39:46] (memcached, that is) [09:40:39] yeah that is more promising, what do you think about lowering the logging threshold to say debug? [09:40:46] on 2081 [09:41:03] No, we should use what we've got first. [09:41:23] So, taking that transaction id tx6502691d5bc343ebbd651-006943cb1f and searching for it [09:41:45] OK, that's useful [09:42:23] that transaction ID is a bunch of GETs on something that says 404 [09:42:26] e.g. [09:42:27] I proposed that because the "unknown exception" is not very telling [09:42:31] Dec 18 09:36:31 ms-be2066 object-server: 10.192.9.13 - - [18/Dec/2025:09:36:31 +0000] "GET /sdg1/9958/AUTH_docker/docker_registry_codfw/files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267" 404 70 "GET [09:42:32] http://localhost/v1/AUTH_docker/docker_registry_codfw/files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267?symlink=get" "tx6502691d5bc343ebbd651-006943cb1f" "proxy-server 809088" 0.0009 "-" 2010 0 [09:44:10] So, it seems that the container sync server is trying to sync an object that doesn't exist [09:46:30] as FYI, the exception seems to be raised by /usr/lib/python3/dist-packages/swift/container/sync.py:623 [09:46:43] there are ts comparison [09:47:07] I'm concerned by the "symlink" thing there because we don't (shouldn't) have symlink support in this cluster, and we don't in proxy-server.conf but _do_ have it in /etc/swift/internal-client.conf (which seems wrong, but I think irrelevant here) [09:48:33] and the internal-client.conf hasn't changed since at least 2023, so I think that thread doesn't need pulling on [09:48:46] Emperor: I think it is used by the sync.py code [09:49:08] try: [09:49:08] source_obj_status, headers, body = \ [09:49:08] self.swift.get_object(info['account'], [09:49:08] info['container'], row['name'], [09:49:08] headers=headers_out, [09:49:09] acceptable_statuses=(2, 4), [09:49:09] params={'symlink': 'get'}) [09:51:23] Again, I don't think it should be relevant, since I think that without the proxy-servers being told to enable the symlink middleware we shouldn't have symlinks at all [09:52:36] sure but I was replying to you question about why it is used it [09:52:41] and the related concerns [09:53:09] from the code IIUC a faulty replication should stop the whole process [09:53:15] if I try swift stat docker_registry_codfw files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 [09:53:16] but I quickly skimmed it [09:53:17] I get 404 [09:54:12] So assuming the container db on ms-be2081 has an entry for that object (which will need a bit of faffing to check), this is the underlying issue - that the container doesn't contain what the container db thinks it does. [09:58:06] I see other errors unrelated to this one, is this something more widespread? [09:58:30] namely, is it only that binary the problem, or are we looking at more? [10:02:43] I see most of the issues for /srv/swift-storage/accounts1/containers/19492/375/4c244b8ad285838b1df91b5193ff8375/4c244b8ad285838b1df91b5193ff8375.db so maybe not [10:09:37] that will be the relevant container db [10:10:19] anyhow, I can do: swift list docker_registry_codfw --prefix files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/ [10:10:22] and get [10:10:45] back files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 [10:11:23] so: we have at least 1 (probably more, but) object which is listed in the container db as extant but does not actually exist. So we are not going to space today [10:21:53] Dec 18 08:51:02 ms-be2081 container-sync: Since Thu Dec 18 07:50:57 2025: 12 synced [0 deletes, 0 puts], 0 skipped, 1583 failed (txn: tx71bc061fffed4b3b87059-006943c06f) [10:22:14] the number of "failed" varies a bit, but it's always over 1500 [10:24:29] (I don't know if that reflects the number of bad objects in the container) [10:25:23] I am not sure what is the problem though, since the docker registry's swift engine doesn't clean up binary objects when we remove images from the manifest [10:25:42] so I can't think about a workflow where swift codfw gets a binary, and then it disappears [10:30:32] hazard a guess, ms swift was being DoS'd and something went wrong [10:30:48] we've found similar oddities in the commons containers occasionally [10:33:09] is there any procedure to try to clean up the broken db? [10:33:56] like we backup the file, open it and try to say drop row 4401258 [10:34:08] restart the swift sync and see if it moves forward [10:34:27] not pretty I know, but I am not sure if there is an alternative way to clean up [10:39:45] what we did when we had ghost objects in commons was (IIRC) attempt to DELETE the offending object(s) [10:40:51] [with a bunch more logic around how the two clusters work for commons] [10:41:56] [cf T327253 ] [10:41:57] T327253: >=27k objects listed in swift containers but not extant - https://phabricator.wikimedia.org/T327253 [10:43:02] obviously if the object _does_ exist in eqiad then that delete might get propagated; and I don't know if the 3 container dbs are consistent [10:44:52] ok so let me recap to see if I got it - the object leads to a 404 in codfw, but if we try to delete it the operation will be propagated to eqiad, and we may end up having more/all consistency at the end [10:45:01] before even attempting to mess with the dbs [10:45:08] if so, let's absolutely do it [10:45:17] if it is not in codfw we don't care [10:48:27] I'm overdue an RSI break (never mind all the urgent-before-year-end things I really had to do today), so I'm going to take one shortly. We can go with "delete ghost objects", but we might want to e.g. see if the 3 container dbs agree and/or how many objects are thus bad, all of which will take time in a container with 2M objects in :( [10:58:38] I am happy to do the work if you give me a little guidance about what you mean with deleting the object [11:08:02] we'll need a list, and then something like [11:08:54] if swift stat docker_registry_codfw objectname 2>&1 | grep -q '404 Not Found' ; then swift delete docker_registry_codfw objectname 2>&1 | grep -q '404 Not Found' [11:13:10] We can find the container dbs with swift-get-nodes /etc/swift/container.ring.gz AUTH_docker docker_registry_codfw [11:14:57] which tells us /srv/swift-storage/accounts1/containers/19492/375/4c244b8ad285838b1df91b5193ff8375 on ms-be2089, /srv/swift-storage/accounts1/containers/19492/375/4c244b8ad285838b1df91b5193ff8375 on ms-be2081 and /srv/swift-storage//accounts0/containers/19492/375/4c244b8ad285838b1df91b5193ff8375 on ms-be2083 [11:18:36] We can have a quick check on those with sudo sqlite3 --readonly [path] "SELECT COUNT(*) FROM object WHERE deleted == 0" [11:21:44] OK, two say 1963748 once says 1963747 [11:33:58] Having a go at producing a list of sad objects [11:35:52] can we start with docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 ? [11:36:32] try to delete, if it doesn't work (as I think) we may need to just drop the row line on ms-be2083's db [11:38:17] You want to delete that and then observe that it no longer appears in swift list output? [11:38:54] because you think it still will? [11:41:19] (those aren't rhetorical questions, I'm just checking I understand what you'd like us to try and why) [11:45:20] I was trying to follow your line of thought and see if cleaning up that specific bit would lead to a different status of the container sync [11:45:31] because maybe it is just one inconsistency the issue [11:45:43] and if it disappears, and another one pops up, we know that the procedure makes sense [11:45:55] rinse and repeat until we reach the fixed point [11:48:06] straightforward enough test to do, I guess. [11:51:02] elukey: you mean files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 right? [note starting files/ compared with yours] [11:51:31] yeah [11:52:54] on ms-be2081 if I filter logs for /srv/swift-storage/accounts1/containers/19492/375/4c244b8ad285838b1df91b5193ff8375/4c244b8ad285838b1df91b5193ff8375.db I see more than one problematic row id [11:52:55] sigh [11:53:43] Hm, this has worked before (cf the remove-ghost-objects cookbook) [11:54:52] but swift delete docker_registry_codfw files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 says 404 (as one might expect) [11:55:11] Ah, but now indeed [11:55:24] swift list docker_registry_codfw --prefix files/docker/registry/v2/repositories/restricted/mediawiki-multiversion-cli/_uploads/0ac43c8d-5203-4cbb-9b28-a9a3cec7d60f/hashstates/sha256/453267 returns nothing [11:55:37] (it didn't initially, but I guess that's eventual consistency for you) [11:55:53] TIL about the cookbook, should we use it? [11:56:08] No, it's solving a similar but adjacent problem [11:56:39] lovely [11:56:52] My little script that does (for obj in $(swift list) ...) is still going [11:57:01] but it's obviously going to be a while. [11:57:33] I suppose the other thing to do would be to extract object names from the container-sync output [11:58:27] which might be quicker than waiting for 2 million swift stat calls [12:01:14] I have it, for the aforementioned db (at least according to journalctl) [12:01:57] Emperor: https://phabricator.wikimedia.org/T413008#11472466 [12:06:47] right, but at least some of those objects _do_ exist [12:07:01] e.g. swift stat docker_registry_codfw files/docker/registry/v2/repositories/wmf-debci-bookworm/_manifests/tags/0.0.3-20250427/index/sha256/6e10cf463213ac8f19375b241445041880af9da656bef72ab1b0f982d55cfb2a/link [12:07:41] * elukey nods [12:08:15] I have to step afk for lunch, we can proceed later on if you have time, otherwise I can proceed [12:09:43] that object got a timetout at 06:51 this morning which I don't think we need to worry about [12:10:08] okok [12:11:07] I think if we want to try and identify suspect objects from the journal we should only consider those where there was a 'Unknown exception trying to GET' [12:11:19] right, lemme refine [12:11:44] (or let my slow script complete) [12:13:14] updated [12:13:25] can you post your script in the task for tracking purposes? [12:13:30] for future selves [12:14:51] Emperor: if you are ok I'll step afk, we can probably resync in a couple of hours? or more, not sure when you have lunch usually [12:17:26] yeah, I'm going for lunch now, biab [13:25:51] elukey: I've updated the task, but: I've found 134 sad objects from ms-be2081's container-sync journal, suggest we try a "delete iff 404" loop on them (and then wait a bit) [14:10:48] Emperor: +1, should I do it or do you prefer to? [14:12:38] elukey: don't mind, what would you rather? [14:13:02] Emperor: I can try [14:15:10] OK, if you get stuck or it's too annoying, LMK, I've done this sort of thing a few times [14:21:46] if I try `swift stat files/docker/registry/v2...` I get `WARNING: / in container name; you might have meant 'files docker/registry/v2...` is there a gotcha that I should follow? [14:29:40] that's not how you drive swift stat/delete/etc. [14:29:48] It's swift stat containername objectname [14:30:37] (cf https://phabricator.wikimedia.org/T413008#11472671 ) [14:31:07] in this case you meant, I think swift stat docker_registry_codfw files/docker/registry/v2... [14:56:39] Has anyone here done anything like an open directory backed by a swift container? [14:56:55] I'm imagining something like Apache's DirectoryListings rendered html list of the objects and allow users to download the object. [14:57:32] You can do that with S3 buckets quite easily [14:59:16] That sounds promising. Does the persistence team offer S3 buckets? [15:02:07] The artifacts are currently in swift, but I'm not opposed to migrating them. [15:06:23] The "apus" Ceph cluster supports S3, as does the "thanos" swift cluster. But I think we'd want to know a bit more about what you're doing & why to be sure that's the correct approach [15:10:16] elukey: how you getting one? [15:11:10] Currently, flame graph logs are hosted from one host per dc. These log files are pretty large, so we'd like to get them off the local disks to make room for processing. They're uploaded to swift, but not served publically from there. c.f. https://performance.wikimedia.org/arclamp/logs/daily/ [15:14:45] Emperor: sorry I got other pings, thanks for the clarification, I was clearly missing the container name. Retrying now [15:15:30] cwhite: sorry to do the "punt" thing, but can you open a ticket and we'll try and think about it in the new year? I'm afraid I already have way more urgent-before-year-end tasks than I stand any chance of getting done :( [15:25:00] Emperor: I've restarted container sync on ms-be2081, so far I don't see horrors being logged [15:25:25] ah no one was logged sigh [15:26:14] related to files/docker/registry/v2/repositories/amd-gpu-tester/_uploads/f0f238f5-4ac9-4b33-97d3-02343642e0c4/hashstates/sha256/0 [15:26:15] mmm [15:26:43] That's the first in our previous list, maybe we need to wait longer [15:26:44] but I see some HTTP 200s etc.. [15:26:54] (I picked a different one and it's definitely not in listings any more) [15:27:12] Emperor: no worries! I'll do that. Hope things calm a bit for ya soon :) [15:27:54] I also see files/docker/registry/v2/repositories/mediawiki-httpd/_uploads/f527d720-0e7a-4829-a457-02473b8c86d6/startedat that wasn't in the list [15:27:59] so maybe other bits will pop up [15:28:57] but now the logs are mostly ok [15:29:00] at least on that node [15:29:22] At this point let it cook for a bit and see if the e.g.hourly summaries look better? [15:29:46] Dec 18 15:29:30 ms-be2081 container-sync: Container sync report: AUTH_docker/docker_registry_codfw, time window start: 1766071470.665313, time window end: 1766071770.8357816, puts: 221, posts: 0, deletes: 0, bytes: 878662631, sync_point1: 4409700, sync_point2: 4405742, total_rows: 5816852 (txn: tx32d989e111094d098adb5-0069441dda) [15:31:05] going to try to delete the few objects that it complained about [15:33:01] ok done [15:34:00] let's wait and see how it goes [15:43:27] swift stat container on ms-fe1009 gives Objects: 1514864, that is already showing a higher number [15:43:35] I'll add a summary to the task [15:46:14] and done [15:46:21] thanks Emperor :) [15:47:34] ah, good, hopefully this will sort itself out given some time, then. [16:29:26] Dec 18 16:24:39 ms-be2081 container-sync: Since Thu Dec 18 15:24:26 2025: 12 synced [0 deletes, 2890 puts], 0 skipped, 550 failed (txn: txde00e43295194486b58d0-0069442ac7)