Page MenuHomePhabricator

RfC: Content model storage
Closed, ResolvedPublic

Description

This RFC is a proposal to change how we store content model and format in the page, revision, and archive tables. This has mainly come up as we are changing namespaces from a default content model of "wikitext" to "flow-board".

This was approved in a meeting in 2015, but we didn't have an implementation plan, so in August 2016, we're revisiting it. The choices in front of us:

a. Implement this RFC as originally approved
b. Implement a modified proposal fully described by T142980, with one extra table but no new columns, to also cater to the needs of multi content revisions (T107595).
c. Start over (don't pick this one!)

Further details: mw:Requests_for_comment/Content_model_storage

Note: the author of this RFC prefers comments on mw:Talk:Requests_for_comment/Content_model_storage. Comments in this task will be helpful to document the chronology of this RFC, and the consensus around it


See also: T107595: [RFC] Multi-Content Revisions

Event Timeline

Legoktm raised the priority of this task from to Medium.
Legoktm updated the task description. (Show Details)
Legoktm added a subscriber: Legoktm.

15:01 < legoktm> I guess the action is that I'm going to implement the RfC? :)
<jynus> with help, legoktm

@Legoktm: Sorry I missed the RFC meeting. Before implementing this, please note that I plan to propose a change to the revision table that would split all the Content related information into a separate table, to allow multiple content "slots" for each revision. I'm sorry I have been slow with writing that RFC... I wanted to make some code experiments first, in order to get a better feel of how to go about this. However, I already have a pretty good idea of what this should look like on the database level.

I'll sit down now and write a basic RFC, and make it a blocker of this one. The necessary schema changes should be considered and decided together. The implementations of these RFCs don't have to block each other, but they should be coordinated.

Added T107595: [RFC] Multi-Content Revisions as a blocker; Multi-Content Revisions don't need to be implemented in order to go ahead with changing the content model storage, but we should at least have a decision on that first. The schema changes implied by each of the RFCs may otherwise interfere with each other, and cause unnecessary overhead.

I'll sit down now and write a basic RFC, and make it a blocker of this one. The necessary schema changes should be considered and decided together. The implementations of these RFCs don't have to block each other, but they should be coordinated.

This was noted at the RfC meeting and we agreed to continue with this one and *not* block on your RfC. Did you read the logs of the meeting?

@Legoktm skimmed it, and re-read it now. I agree that the implementation of this shouldn't be blocked. I just want to make sure that we are all on the same page regarding where we are headed with representing revisions in the database, and that we minimize schema changes to large tables. So with "blocker" I really mean "closely related". There's no good way to say that in phabricator, is there?

After the RFC meeting, I was also left wondering if we had considered enabling MySQL compression on the table. @Springle, do you think compression might improve performance overall?

and that we minimize schema changes to large tables.

This change requires two schema changes, one to add the new columns, and a second to drop the old ones. In the meeting we discussed that we could postpone the second schema change until your (or any other) schema change needed to happen on those tables.

So with "blocker" I really mean "closely related". There's no good way to say that in phabricator, is there?

I'll add a "See also" in the task description.

Legoktm set Security to None.

@Legoktm with my proposal, you'd drop the newly added column again. Does that make sense?

I propose to introduce a separate table to hold meta-information about revision content (e.g. model and format, but also hash, length, etc). That separate table can be joined against page_current, rev_id, or ar_rev_id. I have described my proposal in more detail on https://www.mediawiki.org/wiki/Talk:Requests_for_comment/Content_model_storage (flow post).

For the record, I do not consider this blocked on T107595: [RFC] Multi-Content Revisions. Rather the other way around: this can be one step towards MCR support, especially with a separate table for content meta-data, see above.

In any case, the two efforts should be aware of each other.

daniel edited projects, added TechCom-RFC; removed TechCom-RFC (TechCom-RFC-Closed).

Putting this back into the ArchCom inbox, to push it a bit.

I'm claiming this as a shepherd in the context of a renewed RFC process to get this unstuck.

Thank you, daniel!

I am happy and interested to apply this change to the db (that would potentially solve ongoing issues some people complained about on the mailing list + safe a lot of memory and disk on databases). However, I think this need a development champion.

Tempting maybe @Legoktm for a normalization process :-))))))))

I might help with the implementation, https://gerrit.wikimedia.org/r/#/c/302492/ is already a good start, i think.

RobLa-WMF added subscribers: Anomie, RobLa-WMF.

This was discussed at length at the 2016W33 ArchCom Office Hour (Minutes: E261, Log: P3846). We discussed three options:

a. Implement this RFC as originally approved
b. Implement a modified proposal fully described by T142980, with one extra table but no new columns, to also cater to the needs of multi content revisions (T107595).
c. None of the above (stay with status quo)

My reading of the meeting is there was tentative consensus around option "b". I think we've settled that it's worth @daniel, @Legoktm and everyone else to keep working on a proposed schema change and some code/etc. Toward the end of the meeting, @tstarling asked that we add the fields needed for T107595 (MCR) in the initial content table "with slot=1 always". @jcrespo ("jynus" on IRC) participated heavily in the conversation, saying near the end that he is looking to understand the general mood, and that "db is not a blocker here". The conversation did continue for a little bit after the meeting (see #wikimedia-tech log: 20160817.txt) where Daniel pointed Jaime to gerrit #302056.

One thing that can potentially stall this indefinitely is the recursive dependency problem (the "hen-and-egg thing" as @daniel suggested in E261), but it seems everyone is motivated to power through this, and the blockers seem theoretical rather than like concrete opposition. As of this writing, @Anomie and @daniel are discussing the schema issues in gerrit #302056.

Daniel provided a summary of the 2016W33 ArchCom office hour discussion (E261) in T142980, which I'll quote here to supplement my earlier comment on this task:

[T142980 and T105652 were] discussed yesterday in E261. There was agreement for me to continue work on introducing the content table as proposed here, but no final commitment to this approach. We did not discuss which fields should be included in the initial version of the content table, though @tstarling thoughtfully suggested to include the fields needed for MCR to keep me motivated. It's still unclear how costly it is to add the relevant fields (cont_hash, etc) later. @Legoktm agreed to help with the implementation, at least by doing code review.

@jcrespo asked me to provide a detailed migration plan, and perhaps some php code for the migration script. These will be my next steps, then. I also plan to work on implementing a generic mapping mechanism for names, that can be used for the names of content models, formats, and roles.

I've made T142980 into a child task of T105652. The relationship:

  • T105652 - Parent RFC. Approved but unimplemented RFC which documents schema changes desired for more robust handling of non-Wikitext data structures as "pages" and "revisions"
    • T142980 - Child RFC. Proposal to implement a "content" table to solve the problems described in T105652. I believe that (as of this writing) the T142980 proposal would allow us (and perhaps force us) to store "content" metadata for every revision in the "revision" table

Daniel is interested in the content table (T142980) to aid migration to MCR (T107595)

I'm moving this to the RFC inbox, to be re-considered for implementation separate from T142980: RFC: Create a content meta-data table. Normalizing content model storage at the same time as re-factoring the schema for revision content meta-data still seems like a good idea in general, but since T142980 seems to be blocked for the moment, it may make sense to do this separately after all.

unassigning myself, since we now track shephers via TechCom-Has-shepherd

The last call on this RFC, as posted to wikitech-l on 2016-12-14, has passed without any new concerns being raised.

The RFC is thus approved as proposed, and implementation can go forward.

Implementation work and deployment schedule should be coordinated with the imminent work on Multi-Content-Revisions. If you are going to work on Content Model Storage, please talk to me first.

daniel claimed this task.

Approved, closing