Page MenuHomePhabricator

Develop a less resource intensive DPL-like extension for Wikimedia to replace DPL (WMF), DPL3, and DPL4
Open, Needs TriagePublicFeature

Description

Feature summary (what you would like to be able to do and where):

DPL was originally released in 2005 for Wikinews, forked several times, and then DPL3 is used a lot on Fandom and Miraheze but it has the downside of being very resource intensive. DPL3 is used on Fandom wikis such as Genshin Impact Wiki in order to fetch and format various lists from various categories. DPL3 was recently superseded by DPL4, but neither are compatible with MediaWiki 1.45+ (yet), nor are they suited for Wikimedia.

This task is about developing a successor to DPL that solves almost all of the performance issues with long queries and load times, even with caching of the results.

Much of the functionality would remain the same as DPL3/DPL4, just under the hood there would be some things that were different. For example, on page save when a query with specific parameters is run the first time, a query should be able to run asynchronously, with an error message shown to the reader that the query is not ready yet (and a link to purge the page). There would not be more than a few DPL queries running at a time to prevent performance issues, so some queries might take hours or days to update depending on properties such as the size of the category, whatlinkshere, etc. where the query is being run.

There should also be a Lua function mw.ext.DPL(...) that would throw a Lua error similar to what is seen above when the query has not finished; as well as a JS function mw.ext.DPL(...) that runs asynchronously (and returns a Promise object) to fetch a DPL query using an API (maybe a REST API such as /api/DPL?[properties] or the action API /api.php?action=dpl&dplqueryparams=[properties]), and if the API does not return a result within a minute it rejects the promise and throws a similar error. This MW API could be rate limited to a specific number of queries per user per day, with higher rate limits for logged in users and even higher ones for bot and administrator accounts.

Each DPL invocation from anywhere other than the APIs would increment the expensive parser function count. There would also be a DPL sandbox that allows people to experiment with and generate DPL queries.

The extension should be backwards compatible with DPL (WMF) and DPL4.

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

Much of the use cases would be on list articles on Wikipedia. For example, one can have a list article on Wikipedia such as "List of K-pop groups" and have a DPL query that fetches all of the groups and presents them in a list.

Another use case would be to allow for the review of speedy deletions by administrators as they can preview the reason why the page has been nominated for speedy deletion.

Benefits (why should this be implemented?):

This would save editor time by allowing pages to be dynamically updated in near-real time. The original implementation of DPL has had problems that prevented DPL from being expanded to other wikis. If this extension was reworked from the ground up addressing much of the performance concerns, it could potentially be used on more Wikimedia projects with little performance concerns.

Event Timeline

Restricted Application added subscribers: Reception123, Aklapper. · View Herald Transcript

An alternative idea that often gets thrown around is to have a DPL like extension that is backed by CirrusSearch. Full text search tends to be pretty good at intersection queries, which is often the problematic part of DPL-like extensions.

Why not just allow only listing all of the pages from a category (+ subcategories)? This shouldn't be really resource intensive (I think?), and cover like 90% of use cases, no? If an intersection is required, just a new subcategory ould be created, same as is constantly done now.

I like the full feature functionality of DPL3, for example the following query would fetch data from an infobox template and put it in a neatly formatted table:

{{#DPL:
| categories = Ive songs
| include = {Infobox song}:title:album:duration
| format = {¦ class="wikitable"\n¦+ Notable Ive songs\n! Title !! Album !! Duration\n¦-, ¦-\n¦, \n, ¦}
| secseparators = [[%PAGE&|,]], ¦¦ , , ¦¦ , 
}}

Or something similar. But performance definitely is an issue and we definitely should address any of these significant performance issues in DPL4. Of course on Wikipedia this assumes that every entry generated from this table is notable, which more than likely is not. But there are still articles that can use DPL on Wikipedia.

Why not just allow only listing all of the pages from a category (+ subcategories)? This shouldn't be really resource intensive (I think?), and cover like 90% of use cases, no? If an intersection is required, just a new subcategory ould be created, same as is constantly done now.

That should be fine (subcategories are a bit tricky, but possibly ok with deepcat), as long as you are limited to specific sorting options.

This wouldnt solve wikinews' usecase at all, but might be sufficient for other users of DPL.

I like the full feature functionality of DPL3, for example the following query would fetch data from an infobox template and put it in a neatly formatted table:

{{#DPL:
| categories = Ive songs
| include = {Infobox song}:title:album:duration
| format = {¦ class="wikitable"\n¦+ Notable Ive songs\n! Title !! Album !! Duration\n¦-, ¦-\n¦, \n, ¦}
| secseparators = [[%PAGE&|,]], ¦¦ , , ¦¦ , 
}}

Or something similar. But performance definitely is an issue and we definitely should address any of these significant performance issues in DPL4. Of course on Wikipedia this assumes that every entry generated from this table is notable, which more than likely is not. But there are still articles that can use DPL on Wikipedia.

I would suggest any future iterations only worry about generating the list of matching articles, and then let lua do stuff like extracting contents. That would probably be more flexible and reduce implementation complexity.

The DPL cache would work to cache results across all DPL entrypoints and would have multiple levels. For example if one wants to filter all articles in categories "Foo" and "Bar", DPL would cache the results of "Foo", cache the results of "Bar", and then filter and cache the results for "Foo" and "Bar". That way if one adds another category "Baz" to filter by, DPL could just look at the cache of "Foo" and "Bar", as well as the cache of "Baz", and combine to build the cache for "Foo" and "Bar" and "Baz".

So essentially materialize every substep of the query, cache heavily, and reuse the substeps later (as an aside i think this is the strategy qlever uses).

I do see some potential difficulties with this strategy (not saying these dont have potential solutions, just that they can be annoying to deal with)

  • potential exponential explosion of substeps. There is a lot of potential queries one could do. You have to store all the substeps separately. This could potentially be a lot of data. (Probably easiest to solve via LRU eviction)
  • difficultly of pushing optimization downstream. E.g. if you only need 10 results, do you fetch only 10 or do you fetch all of them so your cache of intermediate results is more useful in other queries
  • stale results - wikis are always changing, how do you deal with cache invalidation? If you make the intermediate cache expire quickly then it isnt very useful. If you try to update all the caches as wiki updates (like a materialized view) this can be computationally expensive if you have a lot that need to be updated.

When you start heading down this path, you are halfway to making your own database engine which can get really complicated. (Not trying to discourage, just be aware its complex)

I like the full feature functionality of DPL3, for example the following query would fetch data from an infobox template and put it in a neatly formatted table:

{{#DPL:
| categories = Ive songs
| include = {Infobox song}:title:album:duration
| format = {¦ class="wikitable"\n¦+ Notable Ive songs\n! Title !! Album !! Duration\n¦-, ¦-\n¦, \n, ¦}
| secseparators = [[%PAGE&|,]], ¦¦ , , ¦¦ , 
}}

Or something similar. But performance definitely is an issue and we definitely should address any of these significant performance issues in DPL4. Of course on Wikipedia this assumes that every entry generated from this table is notable, which more than likely is not. But there are still articles that can use DPL on Wikipedia.

I would suggest any future iterations only worry about generating the list of matching articles, and then let lua do stuff like extracting contents. That would probably be more flexible and reduce implementation complexity.

See https://www.mediawiki.org/wiki/Extension:DynamicPageListEngine for version of DPL with Lua support. This extension is last updated (by author) in 2022.

I like the full feature functionality of DPL3, for example the following query would fetch data from an infobox template and put it in a neatly formatted table:

{{#DPL:
| categories = Ive songs
| include = {Infobox song}:title:album:duration
| format = {¦ class="wikitable"\n¦+ Notable Ive songs\n! Title !! Album !! Duration\n¦-, ¦-\n¦, \n, ¦}
| secseparators = [[%PAGE&|,]], ¦¦ , , ¦¦ , 
}}

Or something similar. But performance definitely is an issue and we definitely should address any of these significant performance issues in DPL4. Of course on Wikipedia this assumes that every entry generated from this table is notable, which more than likely is not. But there are still articles that can use DPL on Wikipedia.

FYI for this use case my proposed solution is T332484: JSON-based page list.

The DPL cache would work to cache results across all DPL entrypoints and would have multiple levels. For example if one wants to filter all articles in categories "Foo" and "Bar", DPL would cache the results of "Foo", cache the results of "Bar", and then filter and cache the results for "Foo" and "Bar". That way if one adds another category "Baz" to filter by, DPL could just look at the cache of "Foo" and "Bar", as well as the cache of "Baz", and combine to build the cache for "Foo" and "Bar" and "Baz".

So essentially materialize every substep of the query, cache heavily, and reuse the substeps later (as an aside i think this is the strategy qlever uses).

I do see some potential difficulties with this strategy (not saying these dont have potential solutions, just that they can be annoying to deal with)

  • potential exponential explosion of substeps. There is a lot of potential queries one could do. You have to store all the substeps separately. This could potentially be a lot of data. (Probably easiest to solve via LRU eviction)
  • difficultly of pushing optimization downstream. E.g. if you only need 10 results, do you fetch only 10 or do you fetch all of them so your cache of intermediate results is more useful in other queries
  • stale results - wikis are always changing, how do you deal with cache invalidation? If you make the intermediate cache expire quickly then it isnt very useful. If you try to update all the caches as wiki updates (like a materialized view) this can be computationally expensive if you have a lot that need to be updated.

When you start heading down this path, you are halfway to making your own database engine which can get really complicated. (Not trying to discourage, just be aware its complex)

Yeah that is just the nature of power sets.

There might be a smarter way to do caching. We probably should cache the intermediate results of extremely common queries rather than every query. I think of sorted dynamic lists a bit of a luxury and like the window washers of NYC after all the queries are done it should go back to the start and do them again.

FYI, the successor to DynamicPageList3, DynamicPageList4, is now marked as stable (although it's currently incompatible with the latest stable release of MediaWiki). We should probably update this task's title and description to clarify that this is not about that extension.

Mr._Starfleet_Command renamed this task from Develop a less resource intensive DPL4 for Wikimedia to replace DPL, DPL2, DPL3 to Develop a less resource intensive DPL-like extension for Wikimedia to replace DPL (WMF), DPL3, and DPL4.Jan 6 2026, 11:28 PM
Mr._Starfleet_Command updated the task description. (Show Details)

DPL4 is now compatible with MediaWiki 1.44+. According to its MediaWiki.org extension page:

[DPL4] is a continuation/rewrite of Extension:DynamicPageList3. It is a fully reworked code base, with significant code and database speed improvements, and is designed to be fully backward compatible with previous versions. It fixes numerous bugs and adds more features then previous versions and makes the extension more maintainable and secure by converting to MediaWiki's SelectQueryBuilder for example.

How close is DPL4 to something that could work on Wikimedia wikis?

I don't know. Fandom still uses DPL3 it seems.

I do think that we need to somehow measure the expensiveness of DPL queries. I have had issues with timeouts before with DPL3, that probably could be better resolved if we figured out how to increment the expensive parser function count based on the size of the query, before we eventually ignore DPL queries since the expensive parser function count is reached. Maybe every filtering category increases the expensive parser function count by 1.

Fandom still uses DPL3 it seems.

DPL4 is incompatible with MW 1.43, which is what Fandom wikis are currently running on. Once Fandom does their next MediaWiki update, I image they'll switch to DPL4, since DPL3 is incompatible with MW 1.44+.

Maybe every filtering category increases the expensive parser function count by 1.

That sounds like a reasonable approach to me.

How close is DPL4 to something that could work on Wikimedia wikis?

See also T287380.

How close is DPL4 to something that could work on Wikimedia wikis?

Its probably very far away. The original DPL being allowed is largely an historical quirk and wouldnt be allowed today.

Its also a very complex code base which reduces the chance it would be reviewed.