With the implementation if MCR progressing, questions have arisen regarding the desired behavior of link tracking with respect to the content of slots other than the main slot. Link tracking here mainly refers to information maintained by the LinksUpdate, including tables like pagelinks, templatelinks, imagelinks, externallinks, but also page_props, but the question extends to all information maintained by DataUpdate objects returned by Content::getSecondaryDataUpdates.
Status quo
Such link tracking mainly serves two purposes:
- Detect when pages need to be re-rendered, when content they depend on change (e.g. templates).
- Find instances of content or references for removal (e.g. an image that was deleted, external links that have found to be spam).
Within MediaWiki PHP, the tracking information that is maintained via LinksUpdate, is represented in the form of ParserOutput objects. These are consumed by the skin (e.g. to display categories or language links), and by edit filters (e.g. AbuseFilter rules).
Code associations for storage:
- Without MCR: Title relates to Page, relates to the (current) Revision, which has (1) Content. (Via 1 row in revision by rev_id, with rev_text_id pointing to the item in text storage.)
- With MCR: Revision relates to (one or more) Content. (Via multiple rows in slots by slot_revision rev_id, where each entry has slot_revision_id pointing to 1 slot_content_id, with 1 content_address pointing to an item in text storage.)
Code associations for run-time access (currently, without MCR):
- WikiPage provides (1) ParserOutput (WikiPage::getParserOutput / PoolWorkArticleView::doWork).
- WikiPage internally gets ParserOutput by using the page's Revision to get (1) Content object.
- Then Content::getParserOutput invokes Parser with the raw text of the Content object (TextContent::fillParserOutput).
The subject of this RFC is how this will work when a revision has multiple Content objects associated (via slots).
Side notes
The Services team is currently investigating new infrastructure for tracking the dependencies between generated artefacts and editable content in a more fine-grained way. That would allow us to de-couple the tracking mechanism for purging from the one for finding usages for administrative purposes. This option is however likely more than a year out.
Also note that at present, we have no way to track which slot uses a given resources. Adding that information to the links tables is conceptually simple, but is a lot of work for the DBAs, so it should only be done if actually needed.
Questions
- Should the default behavior (eg. when saving an edit) be to store references in link tables from only the main Content slot, or should references from extra (MCR) Content slots also be saved to link tables? In other words, does a ContentHandler (or slot role handler) need to enable tracking, or should it work by default for extension authors and instead have a way to disable tracking?
- Pros of tracking all slots: Meets expectations of end-users. For example, finding external links via Whatlinkshere, settings properties is easy for extension authors. For example, an extension could expose GeoJSON as page_props, and it will "just work", regardless of which slot the content is in.
- Cons of tracking all slots: Some slots may not affect rendering. If we track all slots that means changes to references from slots not used for rendering still end up purging the rendering. Suppressing the default behavior is harder than opting in.
- If the content of an extra slot is not visible (as in: does not affect default page view), should their links be tracked? It seems that, if we only track for purging, the answer should be "no". If we track to be able to find all uses (e.g. Whatlinkshere), then answer should be "yes". Since we track for both reasons, what should the initial implementation of MCR do?
- Pros of tracking always: Allow all references to images, templates, pages, external links, etc to be found by end-users.
- Cons of tracking always: May purge the cached default view when things change that are not used in the default view.
If all usage is always tracked, regardless of how which slots are used by rendering, then the process for creating the combined ParserOutput (and from that, a LinksUpdate) can simply iterate over each slot's separate ParserOutput.
If we instead want to require more explicit tracking, then we could use a "slot-role handler" where code would live that decided what data from the Content object to aggregate on the main ParserOutput.
Beyond that, a "page type handler" could be used to control action handling, e.g. provide a way to hook into the purge action, and it could vary behavior by page type (file page, article, template, etc), but that is for another RFC.
Proposal
Based on the discussion on 28 March (summary at T190063#4091409):
When running links updates (after an edit, etc)
- for each slot, construct a dedicated ParserOutput, and also a ParserOutput for the combined output.
- merge the link tracking information from all slots's ParserOutput into a combined ParserOutput.
- run LinksUpdate with the combined output.
- run all other DataUpdates queued by Content objects with only their own Content/ParserOutput.
This means LinksUpdate will see all aggregated information like before, but newly introduced DataUpdates from non-main slots see only their own, unless they explicitly access or transclude other slots.
Rationale: This approach preserves the maximum of information, and is easy to implement. The fact that it may lead to extraneous data tracking and spurious purging of the parser cache does not seem relevant in the light of the currently targeted use cases. This issue should be revisited in the context of the creation of an entirely new mechanism for tracking dependencies of generated artifacts for purging.
Relevant code experiment:
- https://gerrit.wikimedia.org/r/c/421794/6/includes/Render/RevisionRenderer.php#315 and below
- https://gerrit.wikimedia.org/r/c/405015/47/includes/Storage/PageMetaDataUpdater.php#1136
Further reading:
- Use cases https://www.mediawiki.org/wiki/Requests_for_comment/Multi-Content_Revisions#Use_Cases
- On-wiki discussion https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/MCR-PO
This RFC is intended to resolve questions about the expected behavior of tracking meta-data in links tables (pagelinks, imagelinks, templatelinks, etc), to guide the architecture and initial implementation of MCR related code. This RFC is not intended to gain approval for a technical solution, but of requirements for such a solution.
Note that the mechanism for combining the HTML of multiple slots is beyond the scope of this RFC. The obvious approach is to let each slot decide how it presents itself in the standard "article" view. This allows slots to be freely combined. However, some central control of the layout may be desirable for well-known combinations of slots, e.g. for the integration of MediaInfo on file description page for the DSC project.