[00:17:06] TimStarling: I'm looking at number of replicas we have and stuff for different data (tangent from parsercache and its raid config, and how that fits in the future ideas/plans for storing HTML of all pages indefinitely instead of under a cached TTL in terms of growth scaling). Reminded me to look at external store. [00:17:28] It seems we have 2 intra-dc replicas of es* dbs. And given append-only, that basically has to keep growing. [00:18:06] I vaguely recall something about code existing for compression of similar/nearby blobs and something from a few years ago you saying that we never actually ended up using that much in prod. [00:18:38] yes, compression of about 90% is achievable [00:18:40] Where did that story end up? We did a lot of conslidating I believe, but bottom line - are we or are we not compressing blobs together overall? [00:19:09] the problem is that it required someone to run a maintenance script to effect the compression, rather than it just happening automatically [00:19:33] so when I stopped caring about it, there was some bitrot and nobody could be bothered making it work again [00:20:16] the bitrot bug is https://phabricator.wikimedia.org/T106388 [00:22:24] there is a second minor bitrot issue -- by far the best compression algorithm I found was to compute differences between adjacent revisions, and then to compress the patches [00:23:23] I used a PECL extension which provides very compact binary diffs, but the extension is now unmaintained [00:23:30] right, diff rather than concatenate+gzip [00:24:13] the problem with gzip is that its dictionary size is only 32KB, so concatenate+gzip fails badly as soon as the article is larger than 32KB [00:24:29] compression with LZMA or something with a larger dictionary size would work [00:24:56] but the binary diff thing is also quite nice and is already implemented -- probably someone just has to update that PECL extension for PHP 7 [00:26:33] I'm trying to find in the history release notes what we did those few years ago when all this came up. I recall a lot of code being removed and some things being simplified but I can't find it now. [00:27:05] but if I had more time, I would want to implement some sort of automatic compression which happens as blobs are written [00:27:45] the idea of running an epic maintenance script every year or two has clearly failed as a process [00:28:34] if it can be trusted to run unattended, and graceful enough to run live, it could be a maintenance script run at most once a month automatically. [00:28:53] does that seem realistic? [00:29:08] no, this is the problem, it can't run every month [00:29:35] it was written at a time when (like now) we had a huge backlog of uncompressed storage [00:30:11] right, so the "initial" run would take more than a month, and then after that we'd have forgotten about it or something like that. [00:30:17] and once you have that, there's no point just deleting rows, you need to release disk space to the OS, maybe even reduce the server count [00:30:38] the way the compression script works is to recompress a set of clusters into a new cluster [00:30:51] then you just put the old servers in the bin [00:31:06] I feel like it should be possible not to have to revisit all compressed blobs though, but I imagine you have that part already figured out. e.g. something like leaving the last (few) revision(s) uncompressed so as to give you a way to start. [00:31:25] s/start.$/continue/ [00:31:49] right, the second reason you can't run it every month is because the compression ratio improves the longer you lead it [00:31:54] *leave it [00:32:02] it's like it wants us to procrastinate [00:32:24] because the script doesn't try to pick apart old compressed blobs and append to them [00:33:13] right, but we could write it such that it only compresses blobs together if it's unlikely a better oppertunity comes along. e.g. only compress "full" blocks and only if they don't include the last revision of a page or smth like that. and then regularly compress oppertunities as they come available. that seems like something that could run monthly within the available space without needing to recompact/recompress. [00:33:25] but I see the bootstrapping issue now. [00:34:11] recompressTracked.php is imagined as an epic approximately annual process in which you release disk space to the OS by recompressing all data and then deleting the source tables [00:35:08] the reason for T106388 is because of source table deletion -- if we can't find all the references to all the blobs in a cluster, we can't change the pointers [00:35:09] T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388 [00:36:29] someone suggested to me in like 2005 that we should just store reverse diffs [00:36:59] i.e. every time you update an article, compute a diff from the current version to the previous version [00:37:21] then replace the previous version with the diff, and insert the current version as the full text [00:37:24] right, with perhaps a keyframe every N revisions. not unlike videos. [00:37:49] making the diffs go backwards means that access to more recent revisions is more efficient [00:38:49] ah, I see. latest would be full text. interesting. [00:42:35] some of our cache hit data has been used as public dataset to benchmark different approaches and measure differnet objects/qualities. If not already, I suppose we could make something similar with a chunk of our external data to e.g. compare different approaches like queries needed to access latest rev vs any old rev, redundancy, storage size, operationsl complexity (strictly append only vs append and compress later vs recompress everything). [00:44:30] TimStarling: so it seems the 2015-2017 work in this area was mainly consolidating maintenance scripts, fixing some diffhistoryblob related bugs, and writing up the task tree of T106388. I remember us developing or changing also how HistoryBlob works at runtime, or dropping support for some of the formats. [00:44:31] T106388: Audit all existing code to ensure that any extension currently or previously adding blobs to ExternalStore has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388 [00:44:36] But I can't find that now.. [00:45:05] I remember now that the reverse diff suggestion was made by someone who watched the talk I gave at 21C3 in December 2004 -- I don't know the person who said it but I think he said "that's how we do it in Subversion" [00:46:29] at 21C3 Brion talked about the whole system and Wikipedia in general, and I talked about text storage and compression [00:48:13] maybe the thing to do is to implement some kind of incremental scheme first, and then to run recompressTracked.php one last time [19:12:39] tchin: welcome :) - I see you're getting familiar with the Gerrit buttons for "Ack" etc, took me a while to figure out why one would use that as well. To me the non-obvious realization was that Gerrit allows comments to be marked as resolved and by extent to see at a glance from one's dashboard which of your patches have non-resolved comments outstanding. [19:14:06] where Ack is a shortcut for making it resolved + saying saying "Ack" (instead of just silently resolving/dismissing). [19:23:13] Also when you manually "Submit" a patch (in repos where jenkins doesn't do it for you), Gerrit will warn you if there are still unresolved comments