This RFC proposes a modified version of {T105652}, where a new table for storing content meta-data is added instead of adding columns to the existing `page`, `revision`, and `archive` tables. This would be a first step towards allowing multiple content objects (slots, streams) per revision, as per {T107595}. This was originally submitted as a belated comment to T105652. Filing it as a separate RFC was done to allow it to be discussed as an alternative T105652, though it is really just a modification of the original RFC by @Legoktm.
Two variations are proposed: a minimal one which only includes the columns in the new `content` table that are needed to represent content model and format, and a more comprehensive version, that also introduces a field for the content "role" (slot or stream name), as needed for multi-content-revision support. Further meta-data fields, like length and hash, could then also be moved to the new table right away.
Rationale: While the needs that drive T105652 and T107595 are quite different, their solutions overlap significantly, namely changing the way content meta-data is stored in the `page`, `revision`, and `archive` tables. The idea behind the this RFC is to kill two birds with one stone: allow for more compact storage of content model and format as well as add an indirection between content object and revision. If T105652 and T107595 were implemented independently, their implementations would interfere or may even contradict each other.
The driving consideration is that schema changes are disruptive - not only do they need work by database engineers and updates the MediaWiki core's code, they also mean a breaking change for extensions, tools running on #labs, as well as for any consumers of the raw SQL dumps we provide for download.
-----
Details (originally posted on [[https://www.mediawiki.org/wiki/Topic:T92zmz723prcz2pu|Talk:Requests_for_comment/Content_model_storage]], see also comments there):
Instead of adding columns to various tables (page, revision, archive) as per T105652 as originally approved, I suggest to create a separate table that holds meta-data about revision content (at least the model and format, but we will also want things like the role/slot, blob address, and hash there later, for multi-content-revision support). The table would have at least these fields:
```
lang=sql
CREATE TABLE /*_*/content (
cont_revision INT UNSIGNED NOT NULL,
cont_model SMALLINT NOT NULL,
cont_format SMALLINT NOT NULL,
PRIMARY KEY (cont_revision)
) /*$wgDBTableOptions*/;
```
This table can then be used to acquire the model and format for a given revision by joining `cont_revision` against `page_current`, `rev_id`, or `ar_rev_id`.
This is proposed as the //basic// version of this RFC.
-----
If we want to support multiple content "slots" per revision (as per T107595), `cont_revision` would no longer be sufficient to identify the desired content. A `cont_role` field would be added to identify the role the content plays in the revision (e.g. main, style, categories, meta, blame, etc). `cont_role` would reference a content_role table defined in the same way as content_model and content_format. `cont_revision` and `cont_role` form a unique key. The table would then look like this:
```
lang=sql
CREATE TABLE /*_*/content (
cont_id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,
cont_revision INT UNSIGNED NOT NULL,
cont_role SMALLINT NOT NULL,
cont_model SMALLINT NOT NULL,
cont_format SMALLINT NOT NULL,
-- more fields to add for multi-content-revision support:
-- cont_address, cont_hash, cont_logical_size, cont_is_primary, etc
PRIMARY KEY (cont_revision, cont_role)
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/cont_revision_role ON /*_*/content_role (cont_revision, cont_role);
CREATE TABLE /*_*/content_role (
cr_id SMALLINT NOT NULL PRIMARY KEY AUTO_INCREMENT,
cr_role VARBINARY(32) NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);
```
When joining against `page_current`, `rev_id`, etc., `cont_role` will then have to be fixed (e.g. to the "main" role) to allow a unique match per revision.
As an alternative to the composite primary keyThe auto-increment `cont_id` field is not strictly necessary, we can also have an auto-increment `cont_id` field as the PRIMARY KEY,but being able to identify specific content with a unique id seems like a good idea for the future. and define a UNIQUE KEY on `(cont_revisionFor instance, cont_role)`it will allow us to re-use the same `content` row for multiple revisions by adding another table for relating revisions to content.
The definition of the `content_role` table given here is in accordance with the definition of the `content_model` and `content_format` tables suggested in T105652. All of these could be changed to use an UNSIGNED SMALLINT, and to enforce uniqueness of the readable name using a UNIQUE INDEX.
This is proposed as the //medium// version of this RFC.
-----
If we want to migrate all relevant content meta-data to the new table in one go, the following additional fields need to be defined in the `content` table:
* `cont_address VARBINARY(255) NOT NULL`: the content blob address, populated from `CONCAT("tt:", rev_text_id)`, where the "tt" prefix indicates that the blob is stored in the text table.
* `cont_logical_size INT(10) UNSIGNED`: the logical sizes of the Content object, according to Content::getSize(), populated from `rev_len`.
* `cont_hash VARBINARY(32) NOT NULL`: the hash of the serialized content, populated from `rev_sha1`.
* `cont_is_primary TINYINT DEFAULT 1`: whether this is primary, user generated content. Always 1 for now.
* `cont_deleted TINYINT DEFAULT 0`: whether this content was suppressed, populated from `rev_deleted & 1`.
Adding these fields is not necessary to address the original purposed of T105652. The can be added and populated at a later time, as a further step towards T105652. Moving information from `revision` into `content` needs some work on the PHP side, but nothing as disruptive as the indirections needed for supporting the `content` table and the tables for content model, format, and role.
This is proposed as the //extended// version of this RFC.
-----
As @Anomie pointed out, this is an opportunity to migrate the `archive` table to using the text table, removing the need to move content meta-information between `revision` and `archive`:
* `ar_text` and `ar_flags` would go into the `text` table
* an entry for `ar_rev_id` with the necessarya meta-data would be created in the `content` table
* `ar_text` and `ar_flags` can be dropped.
We will then have to make sure we can distinguish between suppressed and deleted content in `cont_deleted`.
This is proposed as a //follow-up// to this RFC.