Page MenuHomePhabricator

RFC: Create a content meta-data table
Closed, ResolvedPublic

Description

This RFC proposes a modified version of T105652: RfC: Content model storage, where a new table for storing content meta-data is added instead of adding columns to the existing page, revision, and archive tables. This would be a first step towards allowing multiple content objects (slots, streams) per revision, as per T107595: [RFC] Multi-Content Revisions. This was originally submitted as a belated comment to T105652. Filing it as a separate RFC was done to allow it to be discussed as an alternative T105652, though it is really just a modification of the original RFC by @Legoktm.

Three variants are proposed for the new table: a basic one which only includes the columns in the new content table that are needed to represent content model and format, a medium variant that also introduces a field for the content "role" (slot or stream name), as needed for multi-content-revision support, and an extended version that introduces all fields needed for multi-content-revisions, like length and hash, right away.

Rationale: While the needs that drive T105652 and T107595 are quite different, their solutions overlap significantly, namely changing the way content meta-data is stored in the page, revision, and archive tables. The idea behind the this RFC is to kill two birds with one stone: allow for more compact storage of content model and format as well as add an indirection between content object and revision. If T105652 and T107595 were implemented independently, their implementations would interfere or may even contradict each other.

The driving consideration for this proposal is that schema changes are disruptive - not only do they need work by database engineers and updates the MediaWiki core's code, they also mean a breaking change for extensions, tools running on Cloud-Services, as well as for any consumers of the raw SQL dumps we provide for download.

Update 2016-09-05: The details of this RFC are now maintained as part of the MCR proposal MCR Content Meta-Data Database Schema.

Event Timeline

daniel moved this task from P1: Define to Request IRC meeting on the TechCom-RFC board.

IMO we may as well go right for the second option since it'll save having to screw around with adding cr_role and changing the primary key for T105652 and I don't think there's any doubt that we're going to do T105652. In the interim we can add a single row to the content_roles table and use that as a constant.

cnt_revision INT NOT NULL,

Data type should be INT UNSIGNED to match revision.rev_id. Wouldn't hurt to make the others unsigned too, unless you're wanting to use negative numbers for some defined purpose.

Also, maybe we should use "con" or "cont" instead of "cnt" as a prefix.

CREATE INDEX /*i*/cr_role ON /*_*/content_role (cr_role);

Personally I'd make this UNIQUE to avoid the possibility of some sort of race or database corruption resulting in two rows with the same cr_role.

-- more fields to add for multi-content-revision support:
-- cnt_address, cnt_hash, cnt_logical_size, cnt_is_primary, etc

Since these seem to be replacing certain revision table fields, it might be nice if we could define them before running the maintenance script to populate the new table instead of having to run it twice. It looks like the mappings would be

  • rev_lencnt_logical_size
  • rev_sha1cnt_hash
  • rev_text_idcnt_address (likely with some prefix added)
  • If we want per-slot revdel, possibly rev_deleted & 1cnt_deleted

For the archive table there's additionally an opportunity for translating ar_text and ar_flags to the text table system.

@Anomie Thank you, I have updated the description according to your comments.

We will then have to make sure we can distinguish between suppressed and deleted content in cont_deleted.

Content that has been revdeled would have cont_deleted set. Distinguishing between ordinary revdel and suppression could either store the flag in cont_deleted (valid values 0, 1, or 9 instead of just 0 or 1) or could rely on the suppression bit in rev_deleted or ar_deleted, depending largely on whether we want to individually suppress content or continue applying the suppression flag to the revision as a whole.

Content for deleted pages would be identified by the fact that there is no row corresponding to cont_revision in revision, rather it's in archive. This is how it works already for the text table for stuff deleted since MediaWiki 1.5.

This was discussed yesterday in E261. There was agreement for me to continue work on introducing the content table as proposed here, but no final commitment to this approach. We did not discuss which fields should be included in the initial version of the content table, though @tstarling thoughtfully suggested to include the fields needed for MCR to keep me motivated. It's still unclear how costly it is to add the relevant fields (cont_hash, etc) later. @Legoktm agreed to help with the implementation, at least by doing code review.

@jcrespo asked me to provide a detailed migration plan, and perhaps some php code for the migration script. These will be my next steps, then. I also plan to work on implementing a generic mapping mechanism for names, that can be used for the names of content models, formats, and roles.

Meeting Minutes

Logs: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-office/ (robla, 21:03:15)

  • LINK: https://phabricator.wikimedia.org/E261 (robla, 21:03:31)
  • LINK: https://phabricator.wikimedia.org/T105652 (robla, 21:03:47)
  • LINK: https://phabricator.wikimedia.org/T105652 (DanielK_WMDE_, 21:03:48)
  • LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_, 21:03:54)
  • LINK: https://phabricator.wikimedia.org/T142980 (DanielK_WMDE_ 's revised proposal) (robla, 21:04:16)
  • primary question to resolve: do a) legoktm 's original T105652 b) DanielK_WMDE_ 's modification T142980 c) none of the above (stay with status quo) (robla, 21:06:41)
  • 14:08:34Â <legoktm>Â my original plan wasn't to do joins, but to store the id => string mapping in a cache like APC since it would be mostly static once initialized (robla, 21:09:24)
  • ar_rev_id is not fully populated on enwiki. we can assign fresh revision ids though (and bump rev_id accordingly) (DanielK_WMDE_, 21:15:31)
  • <DanielK_WMDE_> we will need to construct legacy rows eventually, when we move the blob address into the content table. (brion, 21:19:19)
  • 14:16:31Â <jynus>Â the main issues happen when dataset doesn't fit into memory, which is exactly what I blocked (as the initial rolling in was going to do) (robla, 21:19:28)
  • Discussion of DanielK_WMDE_'s question: "can I assume that there is still consensus on representing content model and format as integers, and have a mappoing in the db and in memory?" (robla, 21:23:20)
  • <brion>Â jynus: I always hear enums are cheap to change. Lies? :) <jynus>Â they are cheap to add [...]Â but if you want to delete, it would be one of our most complex changes (robla, 21:26:14)
  • re managing ids for content models etc: on a cache miss, check the db. if the db doesn't have it, add it. (DanielK_WMDE_, 21:26:26)
  • tentative agreement to content model and format as int; we rule out option (c) then (DanielK_WMDE_, 21:28:38)
  • <jynus>Â I think a) -> b) is easy to do, why do we want to do b directly (genune question) (robla, 21:30:18)
  • <DanielK_WMDE_>Â we have not discussed whether the new table should just have the minimum fields for now, or the full set needed for MCR <jynus>Â adding new columns on a small table with low traffic is easy (robla, 21:53:42)
  • 14:58:08Â <TimStarling>Â can I just repeat that I am putting my 2c in for MCR fields in the initial content table, with slot=1 always (robla, 21:59:12)

Full log: P3846

This was discussed yesterday in E261. There was agreement for me to continue work on introducing the content table as proposed here, but no final commitment to this approach. We did not discuss which fields should be included in the initial version of the content table, though @tstarling thoughtfully suggested to include the fields needed for MCR to keep me motivated. It's still unclear how costly it is to add the relevant fields (cont_hash, etc) later. @Legoktm agreed to help with the implementation, at least by doing code review.

@jcrespo asked me to provide a detailed migration plan, and perhaps some php code for the migration script. These will be my next steps, then. I also plan to work on implementing a generic mapping mechanism for names, that can be used for the names of content models, formats, and roles.

Thanks for the summary, Daniel! I've added this summary to T105652 to help us keep track of the conversation, and I've made this task (T142980) into a child task of T105652.

I have some more questions which I'll probably post over on
https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Content_model_storage

I think we should go for the more complex option (obviously it requires as any patch a review process, and I asked for a plan, but unused columns should be NULL to save space (simplifying, they take more or less just one bit in that case).

Krinkle subscribed.

(Marking as approved.)

Note that this RFC did not bypass the Last Call period. Rather, it was approved as part of T105652.