Page MenuHomePhabricator

Use a dedicated mechanism to track page dependencies
Open, Needs TriagePublic

Description

Summary
The content of some pages may depend on the content of other pages; for now lets call the former Source Pages and the latter Target Pages. MediaWiki parser needs to keep track of these, so that when there is a change to the Target Page, the cached output for Source Page could be invalidated and its output be re-cached in due time. The way MediaWiki currently does this is by reusing the tables that keep track of links (pagelinks, templatelinks, etc.) but this has caused issues. This task focuses on replacing that approach with a new one, in which a dedicated mechanism (including a new table) is used to keep track of page dependencies. MediaWiki should also expose this new mechanism to extensions, so they can impose additional dependencies beyond what MediaWiki core does.

Details
Let's start with two really straightforward examples of page dependencies:

  • If Source Page has a link to Target Page (i.e. [[Target_Page]] in its source), whether the link is shown as a blue or red link will depend on whether Target Page exists. If Target Page is deleted, links to it should be changed to red links; if it is recreated, they should be changed to blue links. MediaWiki uses pagelinks table to keep track of links between pages, and each time a page is created/deleted, it will find all other pages with backlinks and invalidates their cache.
  • If Source Page transcludes Target Page (i.e. {{Target_Page}} or {{Target_Page|...}} in its source), the content of Source Page will be based on the parsed output of the Target Page, given the parameters provided to it as a template. Here, not only deletion and creation of Target Page matters, but also its actual content matters; if its content is changed, all pages that transclude it would need to be parsed and cached again. MediaWiki uses the templatelinks table to identify all such pages.

Now let's talk about two use cases that don't fit this design:

  • If the Source Page contains the {{PAGESINCATEGORY:...}} magic word (which is part of MediaWiki core), and the number of pages in the target category changes, MediaWiki has no mechanism to notice this and trigger a cache invalidation on the Source Page. (See T221795 for related issues surrounding category counts)
  • If the Source Page uses the {{#categorytree:...}} function from MediaWiki-extensions-CategoryTree then its output will change as pages are added to or removed from some target category. This, again, is not something that is currently tracked by MediaWiki.
  • If the Source Page uses the {{#ifxists:...}} function from ParserFunctions to check the existence of a Target Page, obviously, the existence of the Target Page will have direct impact on the output of Source Page. What ParserFunctions does to keep track of this is that it overuses the pagelinks table by inserting a link from Source Page to Target Page, but this causes some undesirable side effects. (See T14019)

Without going to much into what the solution should be, it may be helpful to set the expectations. One could imagine a pagedependecies table that specifically tracks that the parsed output of some Source Page depends on the existence and/or output of some Target Page. This way, if ParserFunctions wants to introduce a dependency from Source Page to Target Page, it can add a row in pagedependencies without tainting pagelinks. Similarly, parser can check pagedependencies when parsing a page to see if the new output has changed which categories the page belongs to and this could impact what other pages whose content may rely on the contents of said categories.

Ultimately, this task could open the way for T56902: Deprecate and remove the purge action from MediaWiki

Considerations
It may be best to still use the *links tables as we do now, but additionally use a dedicate mechanism to keep track of dependencies that cannot be properly captured using the above mechanism. The advantage is that our pagedependecies table will not duplicate the data that already exists in the *links tables; the disadvantage is that it will fragment the dependency-tracking process even further.

Event Timeline

JJMC89 removed Rickygaleana007 as the assignee of this task.
JJMC89 added a subscriber: Rickygaleana007.
JJMC89 removed a subscriber: Rickygaleana007.

Change 970863 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] schema: Add pagerelationlinks table

https://gerrit.wikimedia.org/r/970863

Change 970864 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] LinksUpdate: Add support for pagerelationlinks table

https://gerrit.wikimedia.org/r/970864

Change 970865 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] [PoC] parser: Use ParserOutput::addPageRelation on some parser functions

https://gerrit.wikimedia.org/r/970865

I'm not a native speaker so correct me if I'm wrong, "page dependency" is more clearer than "page relation". Relation doesn't imply what kind of relation or impact a change is going to have. I'd go as far as to say since we have "pagelinks", we should call it "pagedependencies" or "pagedeps" (we do have "module_deps", so we could also go with "page_deps" or "page_dependencies").

Beside that, I think this is a good idea and should be done.

Naming is always hard, I am also not a native speaker, but dependency feels stronger from a technical point as a relation between the pages. Having the name ending with *links would make it clearer about the usage in LinksUpdate as all the other tables are also named like that. The name "backlinks" is not possible as that is the name of the whole concept/the class Backlinks is used to queue the jobs etc.
I happy to change my patch set when there is a idea for a good name or consent on another name.

Native speaker chiming in to say I agree: dependency means literally that A depends on B, relation is vague and can mean all kinds of things. Not going to weigh in on other aspects of the name though, carry on :-)

Possible names for the new table could be:

  • pagerelationlinks
  • page_cachecoherency
  • pagetracking
  • pagedependencies
  • page_dependencies
  • ...

@tstarling, @Huji, @Ciencia_Al_Poder
You have started a discussion about this feature in the context of #ifexsts in T14019#7662302, can you help to find a good name for a database table to store the new information in a generic way for more parser functions, not only #ifexists? Thanks.

Without a good name for the table + table prefix and a name for the function in ParserOutput the task cannot be implemented, I am not good in naming things, at least not in (technical) english.

I'll go bother some native speakers.

page_dependencies would be my preference, as a native speaker. For the most part all multi-word table names in the MediaWiki schema contain underscores between the words, the exceptions seem to be all very old tables. Dependencies is a better word than relations, as it implies the direction of the relationship.

On a somewhat-related note, given that MCR is a thing, do we want to be tracking these dependencies at the page level or the slot level?

I vote for page_dependencies. Sounds more natural and describes the intention of what the table is just by the name. The only issue with that is "dependencies" in the name could not only mean it depends on other pages but on other things.

So another vote could go for page_relations or page_relation_dependencies.

I'm questioning the reason to introduce this table. This task seems mainly driven by feature requests that are not substantiated by user cases or user demand, not approved for implementation, and imho not yet very thought through (e.g. supporting propagation of category tree and category member counts), and for which alternatives should imho be explored first. In chatting with @Ladsgroup briefly, he indicated that this table should only be used for #ifexist and avoid scope creep at this time.

The issue of #ifexist is more tangible and substantiated by repeated Communit Wishlist demand. I've left a suggestion for an simple alternative solution to that problem at T14019#9466679.

Having reviewed the linked patches, my vote is for it to be called ifexistlinks and for the implementation to be moved to ParserFunctions. The schema does not seem to be generic enough to justify having a generic name.

If it addressed two use cases, rather than just one, then I would reconsider. But the schema apparently does not allow it to address more than one use case.

Have a look at T253026: Introduce a centralized Dependency Tracking Service for a truly generic approach to this problem. It's a broad and complex topic. That task gives you an idea of what it would take to be able to claim that you are introducing a system to track page dependencies, rather than just addressing a couple of edge cases.

In T14019#9467421 I proposed existencelinks. I don't like page_dependencies because pages can depend on each other in ways other than an existence check.

On a somewhat-related note, given that MCR is a thing, do we want to be tracking these dependencies at the page level or the slot level?

Tracking of parser metadata is per page and only relevant when there is a slot with wikitext (or parse is enabled via config for the content model), not sure if the wikitext must be in the main slot, but feels easier (maybe there could be extra slots with wikitext providing parser metadata to track in this tables).

[...] In chatting with @Ladsgroup briefly, he indicated that this table should only be used for #ifexist and avoid scope creep at this time.

I am not focussed on #ifexists, but that would be on the list as well. Other parser functions already misusing link tables for tracking like the PAGESIZE parser function. From my point of view the use of PAGESIZE should not be tracked as transclusion (T20188) and not showing under the edit form as used templates and needs a tracking database table as well. Maybe the api term for this is better, it calling this embeddedin. Under that term the use of PAGESIZE could be acceptable as shown up, but that needs many rewording on the UI.
Using templatelinks also clutter with how MediaWiki-extensions-FlaggedRevs is using the templatelinks table to require additional review for "template" changes.

In order to avoid purges by bots (T56902) there is more tracking of relations between pages needed, a solution in core could make it easier for extensions with parser function to provide the relation as well.
I have written some other use cases for core parser functions on the commit message of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/970865 but not structured further.