Page MenuHomePhabricator

Use the change_tag table to store the proofreading quality level
Open, Needs TriagePublicFeature

Description

ProofreadPage stores a "proofreading quality level" alongside each Page: page revision. This quality level is currently stored inside of the Wikitext content of each revision using the <pagequality> tag. Ensuring the wikitext <pagequality> tag is present, valid an unique is a fairly heavy task.
This approach also requires to fetch the Wikitext content to retrieve the proofreading quality level.
To circonvent that the current revision proofreading quality level is stored in the page_props database table.
ProofreadPage never displays the <pagequality> tag to the editors but hide it behind buttons displayed near the change summary edit field.

view.png (1×1 px, 1 MB)

edit.png (2×1 px, 1 MB)


Since "proofreading quality level" introduction around 2008, the change tags system has been implemented into MediaWiki. It might be relevant to migrate ProofreadPage to it. This way the Page: pages content will be properly separated from metadata and the "proofreading quality level" of each revision will be easily accessible from the database.

There are two options to implement the "proofreading quality level" storage in the change tag systems.

  1. Tag each revision with one of the 5 possible proofreading quality level change tags.
  2. Tag only revisions that changes the change tag with the new change tag.

I have a preference on option 2 because it allows to quickly flag which revision has change the quality level. We already store in the page_props table the quality level of the current revision, data that is the most used. It also decreases the noise when displaying tags.

Volumetry: The biggest Wikisource (fr) currently stores ~3M pages. We can assume that the average number of proofreading quality level changes is at most 5 so it will add less than 15M tags to the biggest Wikisources.

Pros:

  1. Allows to access the proofreading quality level directly from the database.
  2. Separate the content from metadata.
  3. Simplifies ProofreadPage internals.
  4. Might allow to consider Page: pages content as plain Wikitext in the future and fix a lot of things e.g. the VisualEditor.

Cons:

  1. Migration cost.
  2. The old revisions will still contain the "<pagequality>" tag in their wikitext so artefacts of the previous system will remain.

Event Timeline

@Tgr @Legoktm Sorry for bothering you. I believe you where at some point involved in with change tags system. Do you think our plan seems sensible?

Seems more like a use case for an MCR slot.

(Althought I haven't really been involved with change tags; @Ladsgroup might be a better person to ask about them.)

One advantage of change tags is that it allows users to filter watchlists, logs and other similar lists by the relevant status tags, using the existing tag filtering mechanisms. So then you can see all the "proofread" or "validation" actions in a certain list.

16M is not much for change_tag, Wikidata has around a billion. The problem is that change_tag is for revision, not page so it means if you want to have proper watchlist integration you need to add it on all edits and not just the one that changes the proofread status. I see two options forward:

  • MCR
  • page_props (+ change tag when it changes I guess).

Thank you @Tgr and @Ladsgroup for your feedbacks!

"The page_props (+ change tag when it changes)" is what we were considering (the page_props part is already implemented since a few years alongside automated categories).
the MCR solution does not solve our goal that is to expose this data into the database. But it's indeed a great plan to move the data outside the main slot if the revision table is not considered "safe enough" for the use case (even thought using a slot for one byte of data seems quite heavy to me).

The problem is that change_tag is for revision, not page so it means if you want to have proper watchlist integration you need to add it on all edits and not just the one that changes the proofread status.

That's a great point and a very valid use case. But it seemed to us that tagging all edits will make common usecases like "lookup new validations" quite heavy and makes the history a bit noisy. And, for the last revision, there is already the page property. If there is community demand for "find edits to the validated pages" we might add a tag for that (the other quality level are more or less "page in progress" so patrolling of them is likely to be less careful). If we change our mind afterward, "filling the gap" by adding the quality level tag to all intermediate revisions is just a db migration script away. So, tagging only the revision change does not prevent change if our assumptions are wrong.
What do you think about it?

Thank you again!

From what I'm seeing, implementation-wise change_tags would work but conceptually MCR makes much more sense, we can have a general slot for that and then add more info to the slot later if you have some plans.

@Ladsgroup Thank you! So what would you think about MCR for storing the "main" data and some change tags as "secondary" data for queryability and recent changes filtering? This way we would get the best of both worlds.

@Tpt I also prefer the MCR route. I think that might allow better handling of the migration cost too.

Maybe a database script to add the slots to the revisions based on the current pagequality extension tags in the main slots could be used. I don't like altering past history and would not suggesting attempting to alter the historic main slot containing the pagequality tags but after migration they can effectively be ignored.

Since we are considering adding MCR slots, I also recommend we change the format and do away with the pagequality extension tag (except maybe to ignore it in past revisions and ensure to remove it in all new revisions). The new Proofread Page slots with pagequality data could be JSON (not unlike how Wikibase handles things) unless you have a better recommendation. Then in many ways this could be handled how Structured Data on Commons was added on (although they did not have an existing migration issue).