We have a need to track several bits of data per *title*, independent of revisions corresponding to this title:
- Page deletions and other protection information. A page corresponding to a given title can be deleted many times, which a versioned page table can capture.
- The title's rename history (so that we can reconstruct linear histories), possibly in the form of renamed_from and renamed_to fields.
- The MediaWiki page_id and other bits from the page table. Note that a single page title can map to multiple page ids over time if the content was deleted & re-created or restored, or renamed to & from other titles. Again, a versioned page table can capture this with one row per version.
- Efficient and (ideally) ordered listings of all titles (T89564). The order requirement is actually not so trivial to support for large data sets without more secondary index work, so maybe worth deferring?
- Possibly, at a point in the (far) future, a synchronization point for atomic moves and renames.
We should start to support this properly in RESTBase. We should probably expose this information at /page/title/{title}. This will also give us a natural resource path for page-related events like page creation or deletion.
Logically we can then check whether a page is deleted on each revision access. This would bring the number of queries per revision request to three for old revisions (one additional check for revision deletion), and two for new revisions. In local testing, an extra revision metadata request currently roughly halves throughput, so there is a big advantage in retrieving all protection information in a single request. We might be able to avoid the extra page metadata request by also storing page deletion information in a static column in each key_rev_value bucket. The static column is shared between all revisions of a given domain & title, so only needs to be updated once on page deletion, for each content type. Static columns can be added & removed with a schema upgrade. In order to keep old revisions of a page hidden while allowing new content creation at the same title, we might need to store a timestamp or timeuuid indicating the time before which content should be suppressed. If we also denormalized revision deletion information per-row, then we could reliably load all suppression information in a single I/O.
Strawman design (WIP)
Page table schema (yaml syntax):
attributes:
title: string
event: string
good_after: timeuuid
page_id: int
# more attributes
tid: timeuuid
index:
- type: hash
attribute: title
- type: static
attribute: good_after
- type: range
attribute: tid
order: descNotes:
- one row per change with tid; *all* changes (including deletions or renames) create new rows with new tids
- event can be "creation", "move_to:<title>", "move_from:<title>","deletion"; alternatively, we could consider separate rows or a map; a rename will insert two rows at with the same tid.
- The handling of deletions / restrictions likely isn't quite right yet; should go through restriction & query use cases to figure that out:
- Page is deleted (currently: last event is "deletion"): all revisions with this title should be hidden.
- Page is not current: Last event is not "creation" or "move_from:*" (not nice to check this!). Queries asking for the 'latest' revision using this title shouldn't return anything, but old revisions can still be retrieved (up until the next deleted entry in the page table, which matches good_after).