Page MenuHomePhabricator

Determine how to update old compressed ExternalStore entries for T181555
Closed, ResolvedPublic

Description

Per T181555: Remove use of PHP serialization in revision storage, we want to replace the existing ExternalStore entries that are using HistoryBlob objects with different "objects" that don't require the use of PHP's unserialize() method.

On WMF sites, the ES clusters with these objects are on databases that are running read-only since we normally never have any need to change them. But that's a problem for this non-normal situation.

I see two three ways to handle this:

  1. Make the old ES databases writable temporarily so the existing rows can be updated by the migration script.
  2. Write all the replaced objects to a new ES cluster.
    1. Write them to one of the existing active clusters.
    2. Create a new cluster just for this migration.
  3. (added after discussion) Run recompressTracked.php over the existing ES clusters, including the ones that are already compressed.

The tradeoff here is storage space versus whatever difficulty is involved in making these clusters temporarily writable. The code I've written for T181555 will work with either option (well, not yet option 3, but that'd be easy to account for).

Event Timeline

Anomie triaged this task as Medium priority.Dec 20 2017, 8:44 PM
Anomie created this task.

I would definitely suggest using a new cluster (active or separate, probably a separate one with 1:1 duplication would allow to verify there is no data loss)- keep current data read only/untouched. Note that clusters are now a days fully logical entities, and that they can physically be on the same machines than others- but I would like to keep ES append-only (and we can delete full clusters later). We would need to estimate how much extra disk we need, though.

Note reconstructing ES servers is a huge project (we are talking rewriting 20TB of data after compression), and that the last project to rewrite it (T106363) was a blocked on having proper backups (for which we do not have space yet). I would say this would require the same kind of resources, even any option, which we will probably won't get until Qs 2018-2019.

If we rewrite all, or most of the objects, as it is such a long process, I would personally would like to evaluate the usage and implementation of a more modern compression at the same time (e.g. let the storage automatically compress revisions from the same page), for example, RocksDB, by Facebook, and simplify what mediawiki does with it (compression, consolidation).

Note that clusters are now a days fully logical entities, and that they can physically be on the same machines than others

ExternalStore "clusters" don't even have to be in separate databases, each one is just a single table. Currently rc1, cluster1, cluster2, cluster3, cluster4, cluster5, cluster6, cluster7, cluster8, cluster9, cluster10, cluster20, cluster21, cluster22, and cluster23 are all just differently-named tables in the same MySQL database on the same machines. Only cluster24 and cluster25 are on different MySQL instances.

We would need to estimate how much extra disk we need, though.

[...]

If we rewrite all, or most of the objects

As far as I know, no code that generates objects has been run for many years. A quick check seems to support this.

  • dewiki has 38629715 rows referring to objects and 131576382 rows referring to non-objects. The non-object rows don't need to be touched for T181555.
  • enwiki has 176179047 rows referring to objects and 645699015 rows referring to non-objects.

Although keep in mind that multiple rows can refer to the same compressed object, and most of these objects probably are compressed objects, so the actual number of objects is smaller. On dewiki, there are 25444940 rows referring to just 2792521 objects on rc1.

In both cases, all the rows referring to the rc1 cluster ('blobs' table on es1012 and its mirrors) are objects. Objects were also located on cluster3, cluster4, and cluster5, along with non-object rows. None of the other clusters for enwiki or dewiki had referenced objects.

I would personally would like to evaluate the usage and implementation of a more modern compression at the same time (e.g. let the storage automatically compress revisions from the same page), for example, RocksDB, by Facebook, and simplify what mediawiki does with it (compression, consolidation).

We have two methods of compression of a series of revisions of a page already available on the MediaWiki side:

  • Concatenating the revisions then gzipping.
  • Calculating binary diffs between successive revisions, then gzipping the base revision plus the binary diffs. Actually using this one for new compression currently requires installing a PHP extension and the C library it needs.

But new revisions currently can't make use of these compression methods as they're being written. The current model seems like it would work something like this:

  • Create new active ES clusters for MediaWiki to write into.
  • Run trackBlobs.php on the formerly-active ES clusters.
  • Maybe create a new "compressed" ES cluster.
  • Run recompressTracked.php to compress the blobs from the formerly-active ES clusters into the "compressed" cluster.
  • Decommission the formerly-active ES clusters. T106388 might block this step, though.

@tstarling would know better about that, though.

On the plus side, it looks like that process would indeed have the side effect of cleaning up all the objects in ExternalStore. So that could be a third option. The full process in that case would be:

  • Rows in the archive table from before MediaWiki 1.5 that are still storing text in the ar_text column need to have that text migrated into ExternalStore.
  • Rows in the text table that are storing objects need to have that data migrated to ExternalStore, and rows referring to external objects need to have their syntax updated.
    • That would be the code currently written for T181555 with a "don't modify ES blobs" option.
    • This should let us finally drop the 'cur' table that still exists on enwiki and some other older wikis.
  • Optionally, we could also migrate any rows in the text table that are storing a nontrivial amount of text into ExternalStore, if there are any.
    • That would be the existing moveToExternal.php script.
  • Then run trackBlobs.php and recompressTracked.php to recompress all the existing ES clusters into new ES clusters, either all at once or in batches.

@Anomie Apologies if I didn't express myself clearly. You may or may not have understood my proposal, it is not 100% clear to me- apologies if you did. With "evaluate the usage and implementation of a more modern compression at the same time" I don't means changing mediawiki code, but stop supporting some of those formats that you mention at mediawiki/application level, and let the storage server do that transparently so code is simplified (and rewrite old formats, which seems to be relatively complex and diverse)- the idea being mediawiki only (*) having to do "INSERT", and let the protocol and storage engine store it as prefers, retrieving will convert into a simple SELECT. Compression and performance has improved a lot on storage backends in the last 10 years, and even if almost surely some tuning has to be done (storing similar pages together for better optimization), simplifying ES code would could be a huge win. I have not tested anything of this, so I would only like to evaluate to see if it is viable to do it.

My main reason to do it (again, stress on evaluate first) would be to simplify code maintenance- we depend a lot on Tim's knowledge, and personally, I would like simplify our most obscure pieces of code by substituting it with relatively standard infrastructure services, open source, but maintained by reliable third parties (think the conversations I had of mediawiki load balancer vs. potentially introducing a proxy). In this case, the change wouldn't be that large- we would keep using MySQL just (same interface), with a specialized append-only engine (RocksDB). This is only an idea, only because I though it would be a large rewrite- and wanted to throw it in to the mix, to be done now or at a later time.

(*) It is never as simple as that, but you get the idea.

@Anomie, clarification: would be fair to say, from your first evaluation, that around 1/3 - 1/6 of all content will have to be read and rewritten for the original task, or am I misunderstanding you? Are objects and non-objects around the same average size, or are some larger than others? Apologies for my lack of knowledge, I am trying to see how many disk resources we will need for this.

Data I have right now:

es1 (the read only hosts): have 11TB available, they are using 7.8T
es2 and es3 (read-write): have 11TB available, 5.25T in use each; growth of around 1.2TB per year each.

"storing similar pages together for better optimization" would be the large code rewrite, since ExternalStore only gets the text without any context. Some of the work for passing context has already been done in the code for MCR, but we'd still have to replace ExternalStore with something that took advantage of that data and then migrate all the existing data (probably using trackBlobs.php or something similar). I'm not opposed to that idea, although I worry that we'll wind up having to support three methods of storing content instead of replacing ExternalStore with this new thing.

clarification: would be fair to say, from your first evaluation, that around 1/3 - 1/6 of all content will have to be read and rewritten for the original task, or am I misunderstanding you? Are objects and non-objects around the same average size, or are some larger than others? Apologies for my lack of knowledge, I am trying to see how many disk resources we will need for this.

1/3 seems high, at least by number of rows. A bit over 1/5 of text rows are referencing objects in the two wikis I checked. But in the one sample I checked there were only 1/10 as many objects as object-referencing rows. So as a count of objects versus non-objects actually in ExternalStore it might be as low as 1/50. On the other hand, it's almost certain that the rc1 cluster is 100% objects, and that clusters beyond "cluster5" (including your es2 and es3) probably contain no objects.

As for size of objects and non-objects in ExternalStore, an individual compressed object will probably be bigger than an individual non-object, since each one contains up to 100 non-object texts (although, on dewiki rc1, the average is apparently closer to 10 texts per object).

If we want to reduce the guessing, any object in ES should match blob_text RLIKE '^O:[12][0-9]:"', while it's pretty unlikely that a non-object will match that. I don't know how many rows we have in ES that are not referenced, though, beyond that comments in db-eqiad.php hint that cluster1 and cluster2 are probably largely unreferenced. If you want a list of all the referenced ES blobs on a wiki with the object/non-object status, this would do it (but will probably be extremely slow and will be a huge result set):

SELECT DISTINCT SUBSTRING_INDEX(txt,'/',1) AS cluster, CAST(SUBSTRING_INDEX(txt,'/',-1) AS INTEGER) AS blob_id, isobject FROM (SELECT SUBSTRING_INDEX(SUBSTR(old_text,6),'/',2) AS txt, (old_text LIKE 'DB://%/%/%' OR old_flags LIKE '%object%') AS isobject FROM text WHERE old_flags LIKE '%external%') AS tmp;

To go the other way, you can see if an ES blob is referenced by checking for old_text = 'DB://$cluster/$id' OR old_text LIKE 'DB://$cluster/$id/%'.

Following from that, and my data, do you think we could fit the rows we have to rewrite on the "old" or on the new servers, or you do not know yet? I can run a long running query on codfw (and so could you!), as it receives currently no traffic.

Hmm. 11TB available, 7.8T used, leaves 3.2T free, which is room to resave about 40% of the data. On dewiki only about 6% of the rows on es1 would need to be resaved. That doesn't tell us what percentage of the actual data would need to be resaved, and I'm assuming dewiki's 6% is representative, but it seems unlikely that that 6% of rows accounts for 40% of the data. I'll run something to try to do a more accurate count, but it'll probably take a while.

On the other hand, if we do want to rework the architecture to let the DB handle the compression, we shouldn't waste time resaving all the ES rows now when we'll just have to migrate them into the new architecture once that's ready.

This work will probably also result in a small spike in usage on es2 and es3 unrelated to this specific task, since we'll be migrating old archive and cur table rows into ExternalStorage. Across all wikis, there appears to be 810M of non-gzipped text in archive (which will be gzipped when inserted into ES), 1254M of already-gzipped text in archive, and 8830M of non-gzipped text in the cur table (although I don't know how much of that is referenced to be migrated). There's also an unknown (but probably not very big, maybe even 0) amount of space that will be used for resaving any non-HistoryBlobCurStub objects from the wikis' text tables into ES.

I'll run something to try to do a more accurate count, but it'll probably take a while.

Counting external objects and external references to concatenated storage on all wikis (in all.dblist), it looks like there are 68284770 blobs to move with a total (stored) size of 247.46G (or 265.71G if you use 10^9 rather than 2^30 for G). The code I ran to determine that took 29 hours 20 minutes, BTW, so not as long as I thought it would take.

If you want to see the breakdown by wiki, see /home/anomie/blobsizes.txt on wasat. testwiki has 21444 blobs for15429294 bytes, and one reference to a nonexistent cluster that caused blobsizes.txt to not have data for testwiki.

That seems not too bad- I assume it takes so much time because you run it serially, which is preferable to avoid overloading server resources for background tasks like this one.

When do you want to run this?- I would like to have at least a one-time backup of all of es, or stop replication on one server of each kind to avoid data loss. Note also that read only servers are not linked through replication, so any change there would not work. I would prefer however, if new created blocks (even if they existed in other form) would be created on the host of the current active clusters, even if on separate clusters- would that work? We can mark the older rows as to be deleted (e.g. we can generate a list of orphan blobs) and only delete them when we check there is no data loss- does that seem reasonable/doable? Basically, perform only inserts, and no DELETE/UPDATES until we know 100% everything went well.

I have no planned time yet. First https://gerrit.wikimedia.org/r/#/c/397632/ needs to be reviewed and merged.

I've already updated the code to make it possible to write to a new ES cluster instead of trying to update the old one, so that's no problem. Using that option, there shouldn't be any need to make a backup since there should be no writes to ES. We may want to make a new ES cluster for this script to write to instead of having it write to the current active clusters, just to keep these old blobs separate from the new blobs being added by live edits.

If we want to go back and delete the obsoleted blobs later from the existing read-only, I'll have to update the script to write the updated blobs to a file. Which is also no problem, I just made that change.

So, to summarize the current plan as I understand it:

  1. Wait for https://gerrit.wikimedia.org/r/#/c/397632/ to be merged.
  2. Create a new active ES cluster for this maintenance script to write to.
    1. Create a new "blobs_foo" table for each wiki, on es1011 or es1014 or some other appropriate server.
    2. Add reference to it here and here, but NOT here.
  3. Run the maintenance script over each wiki, telling it to write into the cluster from step 2 and log the obsoleted blobs.
  4. (Maybe) copy that cluster from step 2 over to es1012/es1016/es1018/es2011/es2012/es2013, update the configuration to reference it there, and drop it from es1011 or wherever it currently is.
  5. (Maybe) determine there was no data loss and that nothing references the obsoleted blobs, then delete the obsoleted rows from es1012/es1016/es1018/es2011/es2012/es2013.

That looks ok to me -config has to be added to codfw in parallel, but otherwise it is ok-, please keep me updated when you are ready to start.

Ok, since we've determined how to do it I'm going to resolve this task. I copied the plan into the description on the parent task T181555.