Page MenuHomePhabricator

Store HTML and page properties with multi-part content handler
Closed, ResolvedPublic

Description

MediaWiki currently stores the entire page content as WikiText. In addition to WikiText, we would like to store

  • The fully expanded HTML DOM
  • Page properties: categories, magic word flags (notoc etc), DISPLAYTITLE, bug 48812, etc
  • Parsoid-internal information: Basically data-parsoid moved out of the main page DOM

Eventually we'd also like to be able to drop WikiText storage without having to rework the storage architecture.

In the current MediaWiki external storage and ContentHandler architecture this can be achieved by adding a multi-part content type with a corresponding ContentHandler. This could be a JSON object or some other serialization.

A possible downside of the compound document approach stems from the need to update transclusion or image expansions for a given revision. With append-only and immutable external storage this can be implemented by storing a new compound document and then updating the revision to point to it. Without garbage collection this will result in several copies of unmodified WikiText and page properties in external storage. However, this issue should probably be addressed in the storage layer.

Note: This is now being addressed with RESTBase.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=35066
https://bugzilla.wikimedia.org/show_bug.cgi?id=50882
https://bugzilla.wikimedia.org/show_bug.cgi?id=53508

Details

Reference
bz49143

Related Objects

StatusSubtypeAssignedTask
OpenReleaseNone
OpenNone
OpenNone
OpenNone
OpenFeatureNone
OpenNone
OpenFeatureNone
OpenFeatureNone
OpenFeatureNone
Resolvedssastry
OpenNone
Resolved GWicke
Resolved GWicke
Resolved GWicke
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolved Cmjohnson
Resolved Cmjohnson
ResolvedJoe
Resolvedfgiunchedi
Resolved GWicke
Resolved Jdouglas
Resolved GWicke
Resolved GWicke
ResolvedArlolra
Resolved GWicke
Resolved mobrovac
Resolved mobrovac
Resolved mobrovac
Resolved mobrovac
Duplicate Jdouglas
ResolvedAndrew
Resolved GWicke
Resolvedfgiunchedi
Resolvedfgiunchedi
Resolvedfgiunchedi
ResolvedEevans
Resolvedfgiunchedi
Resolved GWicke
Resolved GWicke
Resolvedfgiunchedi
Resolved mobrovac
Resolved GWicke

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:48 AM
bzimport added a project: Parsoid.
bzimport set Reference to bz49143.

We should provide an abstract interface to retrieve parts of multi-part content, so that the storage implementation in the backend can be optimized independently. Possibly something like this:

$html = $rev->getPart( "html" );
$wikitext = $rev->getPart( "wikitext" );
$pageProps = json_decode( $rev->getPart( "pageprops" ) );

Parts can be set / updated with $rev->setPart( "key", "value" );

The backend is free to store each part independently or concatenate parts with some efficient segmentation mechanism. This part interface can be used by higher-level content handlers to implement a consistent ContentHandler interface.

gwicke and i seem to disagree whether this plan involves eventually removing page properties (ie, #REDIRECT, NOTOC) from the wikitext/DOM or not.

(In reply to comment #3)

gwicke and i seem to disagree whether this plan involves eventually removing
page properties (ie, #REDIRECT, NOTOC) from the wikitext/DOM or not.

This bug is primarily about the multi-part storage, but I'll reply nevertheless:

The VE page property dialog makes page-global properties easier to discover and modify. This dialog (or something very close to it) can also be used in combination with wikitext editing.

With a page property UI and diffing support for properties in place I don't see a good reason for keeping page properties both in a versioned page property structure *and* inline in the longer term.

Just a clarification for those that have not been following the discussions in the last year or so:

  • We plan to store fully expanded snapshots of HTML for each revision. This makes HTML retrieval fast, but also lets us provide a view of the page as it looked like in the past including all transclusion/extension/file dependencies. There are some storage volume trade-offs (which can likely be addressed with compression), but if we decide to store a snapshot on each re-render after a template/file change, we can provide a copy of each page at any point in the past. Retrieving 'yesterday's Main Page' becomes possible. So far that has only been possible with the flagged revision extension at considerable expense.
  • Other data structures that change with transclusion/extension/file updates will be snapshotted similarly. This applies to dynamic page properties for example.

See bug 851 (from 2004) for the "Yesterday's Main Page" problematic.

Moving page properties to page metadata is tracked in bug 53508.

  • Bug 52796 has been marked as a duplicate of this bug. ***
Arlolra set Security to None.
GWicke claimed this task.

This is now resolved with RESTBase entering beta production. We don't store page properties like categories yet, but are well prepared to do so.