Page MenuHomePhabricator

Remove use of PHP serialization in revision storage
Open, NormalPublic

Description

There are two places in revision storage where PHP serialization is currently used:

  • Revision::decompressRevisionText()'s object flag, used for generic compression and as a hack to avoid migrating data from the MediaWiki 1.4 schema to the schema introduced in 1.5.
  • ExternalStoreDB can store objects that hold more than one revision, as a compression mechanism.

Removing the use of PHP serialization in these places will involve finding the relevant rows and re-saving them in a format that doesn't require PHP serialization, which in turn will require identifying the different kinds of objects that can be present and figuring out how to replace them.

It seems the HistoryBlob interface is intended to be implemented by the objects used in both of these places. Implementing classes include:

  • ConcatenatedGzipHistoryBlob: Effectively an associative array mapping an md5 hash to content, which is then serialized and gzipped.
  • DiffHistoryBlob: A base version and an array of successive diffs to generate subsequent versions. Which is then serialized and gzipped.
  • HistoryBlobStub: A wrapper for one of the above.
  • HistoryBlobCurStub: Accesses the MW 1.4 cur table.

Replacement will involve either loading the data and re-saving it without fancy compression, or figuring out a method to save and restore the compression without using PHP's serialization.


The plan for deployment on WMF wikis (from subtask T183419) is:

  1. Create a new active ES cluster for the maintenance script to write to.
    1. Create a new "blobs_foo" table for each wiki, on es1011 or es1014 or some other appropriate server at DBA discretion.
    2. Add reference to it here and here, but NOT here, and the equivalent for codfw.
  2. Run the maintenance script over each wiki, telling it to write into the cluster from step 1 and log the obsoleted blobs.
  3. (Maybe) copy that cluster from step 1 over to es1012/es1016/es1018/es2011/es2012/es2013, update the configuration to reference it there, and drop it from es1011 or wherever it currently is.
  4. (Maybe) determine there was no data loss and that nothing references the obsoleted blobs, then delete the obsoleted rows from es1012/es1016/es1018/es2011/es2012/es2013.

Event Timeline

Anomie created this task.Nov 28 2017, 7:13 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 28 2017, 7:13 PM
Anomie updated the task description. (Show Details)Nov 28 2017, 7:17 PM

Tentative plan:

  • Write a class that will turn a PHP string[] into a blob without using serialize()/unserialize(). This will probably be a binary format: all the strings concatenated and compressed as the body, and a list of offsets or keys (as length-prefixed strings) + offsets as the header. And probably another header as a JSON object.
  • Use this class to reimplement ConcatenatedGzipHistoryBlob and DiffHistoryBlob where they internally use PHP serialization.
  • Write a maintenance script or two to migrate existing data.
    • ConcatenatedGzipHistoryBlob and DiffHistoryBlob → The reimplementations, tagged somehow.
    • HistoryBlobCurStub → Re-save using the modern mechanism.
    • HistoryBlobStub → Error, tell the user to run maintenance/storage/resolveStubs.php first.
    • Anything else → Re-save using the modern mechanism.

Before actually running these scripts, though, I'll probably do a dry run to see if there are any "anything else" and revise the plan to include them.

Change 397632 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] WIP: Migrate HistoryBlob to alternatives without PHP serialization

https://gerrit.wikimedia.org/r/397632