There are two places in revision storage where PHP serialization is currently used:
- Revision::decompressRevisionText()'s object flag, used for generic compression and as a hack to avoid migrating data from the MediaWiki 1.4 schema to the schema introduced in 1.5.
- ExternalStoreDB can store objects that hold more than one revision, as a compression mechanism.
Removing the use of PHP serialization in these places will involve finding the relevant rows and re-saving them in a format that doesn't require PHP serialization, which in turn will require identifying the different kinds of objects that can be present and figuring out how to replace them.
It seems the HistoryBlob interface is intended to be implemented by the objects used in both of these places. Implementing classes include:
- ConcatenatedGzipHistoryBlob: Effectively an associative array mapping an md5 hash to content, which is then serialized and gzipped.
- DiffHistoryBlob: A base version and an array of successive diffs to generate subsequent versions. Which is then serialized and gzipped.
- HistoryBlobStub: A wrapper for one of the above.
- HistoryBlobCurStub: Accesses the MW 1.4 cur table.
Replacement will involve either loading the data and re-saving it without fancy compression, or figuring out a method to save and restore the compression without using PHP's serialization.
The plan for deployment on WMF wikis (from subtask T183419) is:
- Create a new active ES cluster for the maintenance script to write to.
- Run the maintenance script over each wiki, telling it to write into the cluster from step 1 and log the obsoleted blobs.
- (Maybe) copy that cluster from step 1 over to es1012/es1016/es1018/es2011/es2012/es2013, update the configuration to reference it there, and drop it from es1011 or wherever it currently is.
- (Maybe) determine there was no data loss and that nothing references the obsoleted blobs, then delete the obsoleted rows from es1012/es1016/es1018/es2011/es2012/es2013.