Background
Hi, we've been asked by Site Reliability to circle back through the TechCom process in case you can help us find a more palatable way to store our data. The problem from an Ops perspective is that we're increasing row count in the page and revision tables, which are obviously critical infrastructure, and already at a breaking point. As I understand it, page and revision tables can't be sharded any more granularly than per-wiki, and they must be replicated to every DB node, so they scale poorly. Another concern is that pages can't be deleted in case our project fails or we decide to disable the on-wiki storage component.
The arguments in favor of wiki page storage revolve around how well wiki pages satisfy our requirements for collaboration, suppression, and visibility.
Here's some context for this project:
https://www.mediawiki.org/wiki/Jade
The currently proposed technical implementation and its code:
https://www.mediawiki.org/wiki/Extension:Jade
Exploration of alternative implementations:
https://www.mediawiki.org/wiki/Jade/Implementations
Exploration of alternative implementations (in spreadsheet form):
https://docs.google.com/spreadsheets/d/1y7CPeAFpjOO-FTXLhp9qfO3lx6-OsaroCMNSNJMUFqc/edit#gid=0
Anticipated use cases (pending user tests):
https://docs.google.com/spreadsheets/d/1RPb8VHbseE_xPe46nFqo4QVYmwzgfFJrO4_Wh-QKBSw/edit#gid=0
Older discussion about using wiki page storage for judgments:
https://etherpad.wikimedia.org/p/Jade_scalability_FAQ
T196547: [Epic] Extension:JADE scalability concerns
Proposal
The proposal is to create two new namespaces, Judgment and Judgment_talk (exact names to be decided at T200365). In the content namespace, pages will be a JSON description of judgments about wiki entities such as a particular edit. For example, w:en:Judgment:Diff/123457 would have judgments about whether https://en.wikipedia.org/wiki/?diff=123457 is a damaging or good-faith edit, and its talk page the dialogue leading to this consensus.
Integration
Our first integrations will be to transparently duplicate data from existing user workflows. (UPDATE: We have a new guideline for the project, which is to only integrate in ways that allow for collaboration. That way, our data will have more consistent quality and isn't just a mirror of the simpler, existing processes. Jade data should be produced and reviewed collaboratively.)
The first planned integration is T201361: Jade Implementation: Watchlist integration, which will expose Jade edits and summary information about the judgment in watchlists which track the page being judged.
If the watchlist integration goes well, similar principles can be used to embed Jade in other revision pagers (Special:RecentChanges, Special:Contributions, action=history).
Additional workflows can be enriched by Jade integration, for example we can collect comments during patrolling actions, which has been shown to increase operator accuracy in other domains. We can expose existing Jade judgments in patrolling interfaces, and allow for collaborative interaction with the judgment content. This is quite vague for now and will wait until after the initial integration cycles.
Each workflow integration will be enabled or rolled back by a separate wiki configuration.
Impact
Our estimated impact is to ramp up to a 1% increase in the number of revisions created on each wiki, with a page also created for each of these judgments. The integration will be done incrementally, so this increase doesn't have to happen all at once, but can be stretched out over months or years. We insist that our namespace is only appropriate for human judgments and not bot predictions, so human labor time should be the limiting factor for how much review is performed and how many pages are created. This human labor assumption is the basis for our 1% overhead, and it comes from the total proportion of current wiki edits which are reviewed across all review workflows. Since bots could blow through this ceiling in dangerous ways, we're asking for a social agreement to curb bot abuse in the new namespaces before enabling Jade on any wiki.
Future
In the very long term, we anticipate that structured content models will have dedicated storage support which will be a natural fit for Jade, allowing us to shard more appropriately, and run analysis queries into JSON content. Ideally, that migration will reclaim all storage from MariaDB
Alternatives
See this document for alternative implementations:
https://www.mediawiki.org/wiki/Jade/Implementations
Update, 2018-11-19
We've implemented some of the pilot features for Extension:Jade, see the following resources:
Beta cluster sandbox judgments:
- https://en.wikipedia.beta.wmflabs.org/wiki/Judgment:Diff/376901
- https://en.wikipedia.beta.wmflabs.org/wiki/Judgment:Revision/376901
Browse the source code: