MediaWiki currently stores the entire page content as WikiText. In addition to WikiText, we would like to store
- The fully expanded HTML DOM
- Page properties: categories, magic word flags (notoc etc), DISPLAYTITLE, bug 48812, etc
- Parsoid-internal information: Basically data-parsoid moved out of the main page DOM
Eventually we'd also like to be able to drop WikiText storage without having to rework the storage architecture.
In the current MediaWiki external storage and ContentHandler architecture this can be achieved by adding a multi-part content type with a corresponding ContentHandler. This could be a JSON object or some other serialization.
A possible downside of the compound document approach stems from the need to update transclusion or image expansions for a given revision. With append-only and immutable external storage this can be implemented by storing a new compound document and then updating the revision to point to it. Without garbage collection this will result in several copies of unmodified WikiText and page properties in external storage. However, this issue should probably be addressed in the storage layer.
Note: This is now being addressed with RESTBase.