Maniphest T190063

Tracking dependencies for multiple Content objects per page (MCR)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	daniel
	Mar 19 2018, 3:50 PM

Description

With the implementation if MCR progressing, questions have arisen regarding the desired behavior of link tracking with respect to the content of slots other than the main slot. Link tracking here mainly refers to information maintained by the LinksUpdate, including tables like pagelinks, templatelinks, imagelinks, externallinks, but also page_props, but the question extends to all information maintained by DataUpdate objects returned by Content::getSecondaryDataUpdates.

Status quo

Such link tracking mainly serves two purposes:

Detect when pages need to be re-rendered, when content they depend on change (e.g. templates).
Find instances of content or references for removal (e.g. an image that was deleted, external links that have found to be spam).

Within MediaWiki PHP, the tracking information that is maintained via LinksUpdate, is represented in the form of ParserOutput objects. These are consumed by the skin (e.g. to display categories or language links), and by edit filters (e.g. AbuseFilter rules).

Code associations for storage:

Without MCR: Title relates to Page, relates to the (current) Revision, which has (1) Content. (Via 1 row in revision by rev_id, with rev_text_id pointing to the item in text storage.)
With MCR: Revision relates to (one or more) Content. (Via multiple rows in slots by slot_revision rev_id, where each entry has slot_revision_id pointing to 1 slot_content_id, with 1 content_address pointing to an item in text storage.)

Code associations for run-time access (currently, without MCR):

WikiPage provides (1) ParserOutput (WikiPage::getParserOutput / PoolWorkArticleView::doWork).
WikiPage internally gets ParserOutput by using the page's Revision to get (1) Content object.
Then Content::getParserOutput invokes Parser with the raw text of the Content object (TextContent::fillParserOutput).

The subject of this RFC is how this will work when a revision has multiple Content objects associated (via slots).

Side notes

The Services team is currently investigating new infrastructure for tracking the dependencies between generated artefacts and editable content in a more fine-grained way. That would allow us to de-couple the tracking mechanism for purging from the one for finding usages for administrative purposes. This option is however likely more than a year out.

Also note that at present, we have no way to track which slot uses a given resources. Adding that information to the links tables is conceptually simple, but is a lot of work for the DBAs, so it should only be done if actually needed.

Questions

Should the default behavior (eg. when saving an edit) be to store references in link tables from only the main Content slot, or should references from extra (MCR) Content slots also be saved to link tables? In other words, does a ContentHandler (or slot role handler) need to enable tracking, or should it work by default for extension authors and instead have a way to disable tracking?
- Pros of tracking all slots: Meets expectations of end-users. For example, finding external links via Whatlinkshere, settings properties is easy for extension authors. For example, an extension could expose GeoJSON as page_props, and it will "just work", regardless of which slot the content is in.
- Cons of tracking all slots: Some slots may not affect rendering. If we track all slots that means changes to references from slots not used for rendering still end up purging the rendering. Suppressing the default behavior is harder than opting in.
If the content of an extra slot is not visible (as in: does not affect default page view), should their links be tracked? It seems that, if we only track for purging, the answer should be "no". If we track to be able to find all uses (e.g. Whatlinkshere), then answer should be "yes". Since we track for both reasons, what should the initial implementation of MCR do?
- Pros of tracking always: Allow all references to images, templates, pages, external links, etc to be found by end-users.
- Cons of tracking always: May purge the cached default view when things change that are not used in the default view.

If all usage is always tracked, regardless of how which slots are used by rendering, then the process for creating the combined ParserOutput (and from that, a LinksUpdate) can simply iterate over each slot's separate ParserOutput.

If we instead want to require more explicit tracking, then we could use a "slot-role handler" where code would live that decided what data from the Content object to aggregate on the main ParserOutput.

Beyond that, a "page type handler" could be used to control action handling, e.g. provide a way to hook into the purge action, and it could vary behavior by page type (file page, article, template, etc), but that is for another RFC.

Proposal

Based on the discussion on 28 March (summary at T190063#4091409):

When running links updates (after an edit, etc)

for each slot, construct a dedicated ParserOutput, and also a ParserOutput for the combined output.
merge the link tracking information from all slots's ParserOutput into a combined ParserOutput.
run LinksUpdate with the combined output.
run all other DataUpdates queued by Content objects with only their own Content/ParserOutput.

This means LinksUpdate will see all aggregated information like before, but newly introduced DataUpdates from non-main slots see only their own, unless they explicitly access or transclude other slots.

Rationale: This approach preserves the maximum of information, and is easy to implement. The fact that it may lead to extraneous data tracking and spurious purging of the parser cache does not seem relevant in the light of the currently targeted use cases. This issue should be revisited in the context of the creation of an entirely new mechanism for tracking dependencies of generated artifacts for purging.

Relevant code experiment:

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	• Ramsey-WMF	T199352 Deploy Structured Data on Commons with arbitrary Statements
Resolved	daniel	T194037 Track dependencies for multiple Content objects per page
Resolved	daniel	T190063 Tracking dependencies for multiple Content objects per page (MCR)
		· · ·

Event Timeline

daniel created this task.Mar 19 2018, 3:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 19 2018, 3:50 PM

daniel updated the task description. (Show Details)Mar 19 2018, 3:59 PM

daniel added a parent task: T174038: Initial implementation of MCR page update interface.

In T190063, @daniel wrote:

Should the default behavior be to track only the resource usage of the main slot, requiring handler code for other slots to explicitly add tracking for their content?

For clarity, there are two ways a non-main slot's content could wind up in the default view:

Through parser functions or tags in the main slot's wikitext.
Automatic addition, similar to Cite's automatic addition of the references list if <references/> was not present on the page.
1. By code for each such feature.
2. Generic concatenation.

Obviously for #1 and #2A there will be specialized "handler code" adding the content. IMO that same code can also add any tracking metadata at the same time, much as the code does now for #1-like cases where the content is coming from some other page rather than a non-main slot. Also IMO, there's no need for #2B; if some feature wants that behavior, it's easy enough to implement, and if enough features want it it's easy enough to add a utility method to do it.

If the content of an auxiliary slot is not visible per default (in the standard /wiki/Foo view), should resource usage for it be tracked? It seems that, if we only track for puring, the answer should be "no". If we track to be able to find all usages, the answer should be "yes".

I note that we already don't, (and sometimes can't) do this for wikitext templates as they exist now. If the usage is inside <includeonly>, or is in an {{#if:}} branch that's not taken on the default view, etc., we don't track it. If the usage is to be generated based on the passed-in parameters, we can't track such uses. Or, if you think of non-main slots as being transcluded like templates, we don't track usages that are inside <noinclude> on the transcluding page, or that are in an {{#if:}} branch that's not taken based on the template parameters, etc.

Our wiki editors already handle such cases without too much difficulty. I don't see MCR making that any more difficult that we'd need to go out of our way to track usages that aren't actually used on the main page's view. And some users may even complain if we do, much as how they complain now about "phantom" links or transclusions from parser functions such as {{#ifexist:}} or Scribunto's various mw.title accesses.

Our wiki editors already handle such cases without too much difficulty. I don't see MCR making that any more difficult that we'd need to go out of our way to track usages that aren't actually used on the main page's view.

I disagree. With <noinclude>, the usage is still found in the page relevant for maintenance: on the template page. With <includeonly>, it's more difficult to find out why something is reported as being used on a page, but it's still possible by looking at the templates. Links/templates/categories generated from parameters are rare enough and, by nature, systematic enough, that they are less of a problem.

With MCR however, we will move things that are currently on separate pages (e.g. template documentation) or embedded in the page (e.g. file descriptions) out of the main content. Things referenced in such content will then no longer be tracked automatically. E.g. while it's currently possible to find all references to a given external domain on file description pages, this would vanish if we didn't add tracking for the new mediainfo slot.

Sore, we can do this for each slot individually, but I don't see why. We'll need the same boiler plate for tracking links for all new slots we introduce. I don't see a use case where we would not want that - or at least none where it would be a problem to have it.

The point is: it's the easy thing to do, and it's the safe thing to do. So why wouldn't we?

Also IMO, there's no need for #2B;

I agree that there is no need for blind concatenation of HTML. We will generally want some code to control whether the HTML goes to the top of the bottom, whether it is presented as a section of a floating box, etc. Though I think adding a section at the bottom would be ok as a default.

However, from "require the slot extension to explicitly say where the HTML goes" does not follow "require the slot extension to explicitly make tracking happen". I think tracking should be the default, if not forced, to avoid situations where it is simply forgotten, or done incorrectly. I simply see no case in which we would not want it to happen.

I think this is more complicated than being assumed here.

Let's say I edit the documentation slot of {{Infobox}}. That should not trigger a purge flood, unlike editing the template itself. How would that work, internally? When I use an infobox in an article, what is that a dependency on, exactly? The main slot? The "default view", whatever that is? The whole revision? Do we store that in the link table or in handler logic?

What if edit the structured data blob for Person placeholder.png which is used in a million articles? Should there be some kind of purge? (No reason to, right now; could that change once we have fully machine-readable license and authorship metadata? Or centralized image captions / alt text, which is one of the things planned for SDoC?)

Let's say we have something that's too large for the default view but costly enough that we want it cached. (A blame map, maybe, although the costly part there would probably happen before storing it in a slot.) How would invalidation work, especially if we link tracking to visibility in the default view?

@Tgr You are right: tracking dependencies between resources needs to become more fine grained. That's the idea behind https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/DependencyEngine (which needs an update). The services team has an investigation of this in their anual plan. I don't think there's an empic for this on phabricator yet - perhaps I will add one.

For now, the idea is that MCR should not make things worse, but cannot be expected to solve all problems we have with purging.
So:

Let's say I edit the documentation slot of {{Infobox}}. That should not trigger a purge flood, unlike editing the template itself.

That's a good point. An edit to a non-main slot should not trigger purges based on templatelinks, because template transclusion only transcludes the content of the main slot. This requries onArticleEdit() to know which slots where changed. That's easily doable with the current design for PageUpdater, but I didn't think of that need so far. Thanks!

What if edit the structured data blob for Person placeholder.png which is used in a million articles? Should there be some kind of purge?

No, edits to image description pages don't trigger purges along based on imagelinks. MCR is not going to change that. If in the future image usage will imply usage of some bits of structured data, we will need a more fine grained tracking mechanism to take care of that.

Let's say we have something that's too large for the default view but costly enough that we want it cached. (A blame map, maybe, although the costly part there would probably happen before storing it in a slot.) How would invalidation work, especially if we link tracking to visibility in the default view?

The blame map would only depend on the main slot, and it would get re-rendered when the main slot changes, since we currently have no optimization in place for avoiding that (that's another rabbit hole to dive into). But let's say it's something else, like inline annotation/discussion, which may use templates, but would be invisible in the standard view.

When a template used in an annotation changes, the page would not be re-rendered, unless we a) have a better tracking mechanism (which would ideall re-render only the annotation, not the standard view) or b) track invisible things as if they where visible, at least optionally. (a) is for later, so we will do (b). The question is just whether we track all things, visible or not, per default, or if the we require the respective extension to trigger the tracking explicitly.

In my mind, if the ParserOutput of a non-main slot declares e.g. templatelinks, they should go into the database. And if the Content object of a non-main slot returns additional DataUpdate objects of some kind, they should be executed, regardless of what slot the content lives in.

• brion subscribed.Mar 21 2018, 8:59 PM

We are going to host a RFC IRC Meeting 2018-03-28 in the #wikimedia-office channel at 1pm PST(21:00 UTC, 22:00 CET (NOTE: 1 hour earlier than typical))

In T190063#4062624, @daniel wrote:

With MCR however, we will move things that are currently on separate pages (e.g. template documentation) or embedded in the page (e.g. file descriptions) out of the main content. Things referenced in such content will then no longer be tracked automatically. E.g. while it's currently possible to find all references to a given external domain on file description pages, this would vanish if we didn't add tracking for the new mediainfo slot.

Since the media info is supposed to be displayed in the default view, the only reason that would happen would be if you screwed up writing the SDC code. In other words, this is a bogus example.

Template documentation would be transcluded on the template page's default view somehow or other, much as it is already, so again the only way lack of tracking would happen would be if whoever wrote the "display it on the template page's default view" code screwed it up.

In general, the only way for something to not be tracked would be to have it not actually be used in any context that's tracked. Which can already be done in wikitext too (e.g. in an {{#if}} branch that's only taken for users with certain non-default preferences).

Sore, we can do this for each slot individually, but I don't see why. We'll need the same boiler plate for tracking links for all new slots we introduce. I don't see a use case where we would not want that - or at least none where it would be a problem to have it.

"I can't think of a case where we wouldn't do it, so let's force it everywhere" isn't very convincing, IMO. Nor is "I don't trust people to write functioning code."

And chances are the "boilerplate" would just be to call a $parserOutput->addParserOutputMetadata( $myOutput ) method in cases where that makes sense. We already have addOutputPageMetadata() for a similar use case (special page transclusion).

But we've already argued this to death elsewhere, the point of this task is (I hope) to get other people to weigh in.

daniel updated the task description. (Show Details)Mar 28 2018, 7:11 PM

In T190063#4084994, @Anomie wrote:

the only way lack of tracking would happen would be if whoever wrote the "display it on the template page's default view" code screwed it up.

That's my point: every extension that defines a slot then has the opportunity to screw this up. Few will screw it up completely, but inconsistencies will happen, especially when things change over time.

So, I'm saying: it's better to do the safe thing per default, than requiring everyone to do the right thing all the time.

"I can't think of a case where we wouldn't do it, so let's force it everywhere" isn't very convincing, IMO.

Making it the default isn't the same as forcing it. And "most use cases need it this way" is a good reason to make it the default.

Nor is "I don't trust people to write functioning code."

More like "I don't trust all extension authors to fully understand this" and "I don't rely on all extensions to be kept up to date with core all the time".

And chances are the "boilerplate" would just be to call a $parserOutput->addParserOutputMetadata( $myOutput ) method in cases where that makes sense. We already have addOutputPageMetadata() for a similar use case (special page transclusion).

Providing a utility function for this would make this safer, yes. But this isn't just one line of boiler plate. Where does this line go? Into some combine-this-slot-with-the-default-view-output method, I suppose. With sensible defaults, that method doesn't even need to exist. If we do this your way, I expect to see the same boiler plate implementation of the smae method in all extensions that use this. That seems pointless.

But really, this all depends very much on how we answer question (2). If we only want to track "visible" things, then it makes sense to have the tracking happen explicitly in the same place where we combine the HTML. If we want to track used for maintenance, it makes more sense to just make this happen always.

If in the future we have separate tracking mechanism for these purposes, I suppose the entire discussion becomes moot. Then, explicit fine grained tracking to allow purging of the combined HTML is needed. The the question is only if there is any reason not to always track references to resources.

daniel updated the task description. (Show Details)Mar 28 2018, 7:27 PM

daniel updated the task description. (Show Details)Mar 28 2018, 7:58 PM

Sorry for the last-minute followup :(

In T190063#4065605, @daniel wrote:

Let's say I edit the documentation slot of {{Infobox}}. That should not trigger a purge flood, unlike editing the template itself.

That's a good point. An edit to a non-main slot should not trigger purges based on templatelinks, because template transclusion only transcludes the content of the main slot.

That's not true in general, either; an edit to the TemplateStyles slot should trigger a purge. I think the slot handler needs to be able to tell which is the case (which also means that some sort of role-based handler will be necessary, for at least some of the slots).

What if edit the structured data blob for Person placeholder.png which is used in a million articles? Should there be some kind of purge?

No, edits to image description pages don't trigger purges along based on imagelinks. MCR is not going to change that. If in the future image usage will imply usage of some bits of structured data, we will need a more fine grained tracking mechanism to take care of that.

Again, if we will have default captions in the structured media data as planned, it does not seem outlandish to expect an image wikicode with no caption to pull that in automatically. Even more so with alt text (where it is an accessibility problem currently that most editors don't add it manually). So once again I think we'd have to rely on the role handler to control that.

Let's say we have something that's too large for the default view but costly enough that we want it cached. (A blame map, maybe, although the costly part there would probably happen before storing it in a slot.) How would invalidation work, especially if we link tracking to visibility in the default view?

The blame map would only depend on the main slot, and it would get re-rendered when the main slot changes, since we currently have no optimization in place for avoiding that (that's another rabbit hole to dive into). But let's say it's something else, like inline annotation/discussion, which may use templates, but would be invisible in the standard view.

When a template used in an annotation changes, the page would not be re-rendered, unless we a) have a better tracking mechanism (which would ideall re-render only the annotation, not the standard view) or b) track invisible things as if they where visible, at least optionally. (a) is for later, so we will do (b). The question is just whether we track all things, visible or not, per default, or if the we require the respective extension to trigger the tracking explicitly.

An example of invisible content where you wouldn't want to track changes is TemplateData (assuming there can be dependencies in a TemplateData slot; IIRC there is at least a feature request for description fields that are rendered from wikitext), for the same reason as above.

In my mind, if the ParserOutput of a non-main slot declares e.g. templatelinks, they should go into the database. And if the Content object of a non-main slot returns additional DataUpdate objects of some kind, they should be executed, regardless of what slot the content lives in.

That will get us into trouble with pages where updating the main content is expensive and updating the non-main slot is cheap (such as popular templates with a documentation slot).

Or I guess MCR-ifying features where currently there is an expensive part and a cheap part on separate pages could be simply blocked on the future fine-grained tracking.

Scott_WorldUnivAndSch subscribed.Mar 28 2018, 8:04 PM

@Tgr You are right, we need fine grained per-slot tracking to enable efficient purging. We should keep this in mind. This RFC however is about what to do as long as we don't have that. What behavior do we aim for with the current db schema for tracking meta-data.

Note to self: look at @Tgr's comments on https://www.mediawiki.org/wiki/Topic:U8zvaqr5vxw5d1pw

Logs from yesterday's meeting: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-28-20.02.log.html

Meeting notes: https://tools.wmflabs.org/meetbot/wikimedia-office/2018/wikimedia-office.2018-03-28-20.02.html

Note: @daniel will be updating the description using the full log, since we don't have a real summary in the meeting notes.

Here's my attempt of the summary of yesterday's discussion, which featured @tstarling, @Tgr, @Anomie and myself:

No relevant use cases were found for slots that are not visible in the default view.
The need for a "next generation" fine grained dependency tracking system was once more highlighted. See DependencyEngine for an (outdated) brain dump.
There was agreement that some kind of "slot handler" logic will be needed at some point, and that such per-slot logic could control how the default view's ParserOutput is constructed from multiple slots.
There was some discussion of the idea if visibility of auxiliary slots should be controlled by the skin, and how this plays into the idea of modular content and device specific page composition.
Discussion of tracking references to templates, images, external links, etc being relevant for maintenance work by editors, not just for purging.
Mention of per-slot ParserOutput objects being needed for B/C with hook signatures
Mention of the fact that all link tracking info needs to be present in the "main" ParserOutput for AbuseFilter to pick it up, until AbuseFilter becomes aware of multiple slots.
For the baseline implementation of MCR, @daniel maintains the original proposal:
- create a ParserOutput for each slot plus as ParserOutput for the combined default view.
- Merge all link tracking info into the combined ParserOutput, and construct a LinksUpdate based on that.
- Execute all other DataUpdates for all slots.
- The mechanism for combining the HTML for all slots is left for later, as is other per-slot logic.
No pertinent concern regarding the feasibility of the proposed approach were raised.
- @Anomie maintains that the creation of ParserOutput objects for each slot may be wasteful, and tracking all links for all slots may cause spurious purging of the ParserCache.
- However, no concrete use case was indicated for which this would cause serious problems.
Anomie wants Content objects to know how to add "their" meta-data to a ParserOutput. Daniel says this needs a breaking change to the Content interface, and means the logic will have to be implemented for each type of content, while merging ParserOutputs can be implemented in a generic way.
It was pointed out that constructing a ParserOutput is generally cheap, especially when omitting HTML generation; It's parsing wikitext that is expensive. But we have no use case for which we'd want to have wikitext in an auxiliary slot, and not track things that are referenced from the wikitext. So we'll have to parse it anyway.
There was rough consensus about the principle that the choice between feasible options should be left to the implementor.
There was rough consensus that tracking everything for now is the safe option, with no downside for the current use case (namely, SDC).

The discussion was reviewed during the TechCom meeting after the RFC discussion. The updated proposal is to go on last call until April 11, pending review by @Krinkle.

daniel updated the task description. (Show Details)Mar 29 2018, 2:38 PM

daniel updated the task description. (Show Details)Mar 30 2018, 4:21 PM

Anomie wants Content objects to know how to add "their" meta-data to a ParserOutput.

This is not incorrect. That would be done by the SlotHandler, not the Content object.

In T190063#4097873, @Anomie wrote:

Anomie wants Content objects to know how to add "their" meta-data to a ParserOutput.

This is not incorrect. That would be done by the SlotHandler, not the Content object.

I assume you mean that this is not correct.

If I recall correctly, both options where discussed, on IRC and beforehand. It was not clear to me that you had abandoned the idea of having this logic in the Content object. If this is so, that would mean the SlotHandler will have to know about the content model. This is probably not a problem for most slots, but it is for the main slot - there would have to be different SlotHandlers for the main slot, depending on what content model is used there.

In T190063#4116404, @daniel wrote:

I assume you mean that this is not correct.

Yes, I typoed.

If I recall correctly, both options where discussed, on IRC and beforehand. It was not clear to me that you had abandoned the idea of having this logic in the Content object.

Yes, discussion with you convinced me that putting it in Content was probably too much mixing of concerns.

If this is so, that would mean the SlotHandler will have to know about the content model. This is probably not a problem for most slots, but it is for the main slot - there would have to be different SlotHandlers for the main slot, depending on what content model is used there.

The "main" slot has always been somewhat special, for example it has to be present (even if empty) on every page. In my thinking it probably wouldn't have a SlotHandler at all. MediaWiki would generate the ParserOutput for the main slot as it does now for non-MCR pages, and then allow the SlotHandlers for the other slots to add to that PO.

The "main" slot has always been somewhat special, for example it has to be present (even if empty) on every page.

I would like this the only way in which the main slot is special. We may treat it as a special default for B/C for now, but that should go away.

In my thinking it probably wouldn't have a SlotHandler at all. MediaWiki would generate the ParserOutput for the main slot as it does now for non-MCR pages, and then allow the SlotHandlers for the other slots to add to that PO.

It's an option, but one that I'd rather avoid. Special case code is annoying.

Krinkle updated the task description. (Show Details)Apr 11 2018, 6:09 PM

Krinkle updated the task description. (Show Details)Apr 11 2018, 7:42 PM

Krinkle updated the task description. (Show Details)Apr 11 2018, 9:00 PM

Krinkle updated the task description. (Show Details)

daniel updated the task description. (Show Details)Apr 11 2018, 9:16 PM

Krinkle triaged this task as Medium priority.Apr 18 2018, 8:04 PM

Krinkle moved this task from Request IRC meeting to P5: Last Call on the TechCom-RFC board.

In T190063#4091409, @daniel wrote:

The discussion was reviewed during the TechCom meeting after the RFC discussion. The updated proposal is to go on last call until April 11, pending review by @Krinkle.

Sorry for the delay. I've started the last call now.

Per the process, if no new concerns are raised on this task and/or on Wikitech-l, and the proposal is not changed before then, this proposal will be approved on Wed, 2 May 2018.