Page MenuHomePhabricator

Add ability to generate a list of pages based on prefix to Scribunto/Lua
Open, HighPublicFeature

Assigned To
None
Authored By
MZMcBride
Apr 11 2013, 8:41 PM
Tokens
"Love" token, awarded by lucamauri."Love" token, awarded by Lepticed7."Love" token, awarded by MarioGom."Meh!" token, awarded by Dvorapa."Love" token, awarded by Sebastian_Berlin-WMSE.

Description

Looking at https://www.mediawiki.org/wiki/Extension:Scribunto/Lua_reference_manual, I don't see a way currently to generate a list of pages based on prefix. For example, I wanted to write a module that would take each page listed at https://meta.wikimedia.org/wiki/Special:PrefixIndex/Global_message_delivery/Targets/ and generate output based on iterating over this generated list.

Rather than using a generated list, I was forced to specify each page title. This isn't great, as pages may be added or deleted and I don't want to update such a list by hand.

An equivalent to [[Special:PrefixIndex]] (or the MediaWiki API's list=allpages&apprefix=) inside Scribunto/Lua would be wonderful.


Version: unspecified
Severity: enhancement

Details

Reference
bz47137

Event Timeline

bzimport raised the priority of this task from to Low.Nov 22 2014, 1:30 AM
bzimport set Reference to bz47137.
bzimport added a subscriber: Unknown Object (MLST).

darklama wrote:

I've provided an iterator solution, but it has the same limitations as
the prefixindex special page:

https//meta.wikimedia.org/wiki/Module:Subpages

I think an equivalent to the MediaWiki API's list=allpages&apprefix=
in iterator form inside Scribunto/Lua would be better though.

Jackmcbarn raised the priority of this task from Low to High.Dec 2 2014, 4:26 AM
Jackmcbarn set Security to None.
Jackmcbarn added a subscriber: Jackmcbarn.

Bumped priority up now that the way users did this before (unstrip) doesn't work anymore.

Bumped priority up now that the way users did this before (unstrip) doesn't work anymore.

I really don't care at all what the priority of this (or any) task is.

That said, it feels a bit strange for the sudden absence of this functionality to be considered high priority. The previous implementation (using transclusion) was pretty clearly a giant fragile hack. I think everyone involved knew that this hack was almost certainly going to break at some point as Special page transclusion was never considered a stable programmatic interface.

Special page transclusion was never considered a stable programmatic interface.

It was the only viable for achieving a goal in regard to maintenance work, performance and functionality. At Commons, we used it for listing language subpages (/af, /de, /nl, ...), so if is easier to implement this specific functionality, I'd be happy with that. Perhaps I should open a ticket for this specific request?

And talking about stable interfaces, I don't know how often I had to change my scripts making use of API queries because something changed in incompatible ways. Sometimes I was under the impression that gadgets doing screen scraping had less frequently to be updated.

In a multilingual module, I put translations of arguments names, categories and error messages in the submodule "module_name/I18N". Then the main_module can change without change translations in any language in any wiki. But without the module_name itself I cannot automatize that for any modules.

The present change could resolve that, giving at the same time "module_name/I18N" and "module_name". The change could be helped with a parameter to select a part of sub titles, which contain "I18N" in my case.

I could also ask a change "Get the module_name itself". But the present change is more general and can be used for a group of sub modules and their datas.

Perhaps, each new sub-titled page could record itself in a dedicated table in the "mother page". That could easy help to solve any tree of pages questions. For existing pages a bot could once build these tables.

Perhaps, each new sub-titled page could record itself in a dedicated table in the "mother page".

That sounds like the old problem from T17071: Wikibooks/Wikisource needs means to associate separate pages with books.

Sorry, I was not enough explicit. My proposition was only about sub pages like from Module:pages or from User:pages. About pages of books in wikisource, in Page: space, I don't know if the users of wikisource are interested. These pages are managed by the special Extension:Proofread_Page https://www.mediawiki.org/wiki/Extension:Proofread_Page which compares the text of one page of book in front of an image.

I'm wondering about how this feature would work with the current system of page protection, link tables and the expensive function count.

Every time this new prefixIndex function was used, we would have to have some way of tracking when a page with the prefix was created or deleted. When such a creation or deletion occurred, we would have to update all the transclusions of the page (probably a template) that used it, so presumably every page with the given prefix would have to count the template as a transclusion in the link table.

Now let's say this is a template with millions of transclusions. In this case, anyone creating or deleting a page that has the right prefix would trigger a re-rendering of all of these millions of pages. As things stand, there would be no kind of page protection preventing this, so the person doing the creating or deleting might not have any idea that their action was so expensive. It could also be used maliciously to put unnecessary strain on a site's servers. And while deletion is limited to admins, creation could potentially be done by anonymous users.

The previous workaround forced transclusions to update by simply disabling caching, but that's not an option, as it's even worse from a site-stability perspective. If we did that on a widely-transcluded template, it might actually bring the site down, as the pages would all have to be re-rendered on every page view.

Also, with this function, it would be possible to see whether a given page existed or not. If we treat this like the #ifexist parser function, then we would need to make it an expensive function. In fact, as you can check the existence of many pages at once, presumably we would need to make one prefixIndex call count as many expensive function calls. (As many as there are possible results that could be returned?)

I'm as keen as anyone else to see this feature implemented, but we need to think about how to deal with these questions first.

I'm a bit confused about the concerns you have here.

  • We already allow transclusion of Special:PrefixIndex.
  • We have a unique index on (page_namespace, page_title), so listing by prefix is cheap.
  • Scribunto/Lua modules are treated very similar to templates already and rely on the same cache invalidation infrastructure (links updates, etc.), as I understand it.

Long-term, it would be great if we could rely less on caching. The adoption of Scribunto modules over ParserFunctions templates, the deployment of HHVM, and other changes should get us closer to this goal eventually, I hope.

In my understanding, this function needs to work only when a page is created or renamed or deleted. Then a tree-table is updated for all pages up and down in the tree. Then each of these pages record the part of the table for all of it's sub-pages.
Later a scribunto function gives to the module the tree-table of the page, with no cost. Only access to any of these pages is expensive.

  • We already allow transclusion of Special:PrefixIndex.

Although Special:PrefixIndex can be transcluded, its contents are stripped, meaning that modules can't parse it.

In my understanding, this function needs to work only when a page is created or renamed or deleted. Then a tree-table is updated for all pages up and down in the tree. Then each of these pages record the part of the table for all of it's sub-pages.
Later a scribunto function gives to the module the tree-table of the page, with no cost. Only access to any of these pages is expensive.

I don't understand what you mean here. Can you clarify this comment?

  • We already allow transclusion of Special:PrefixIndex.

Although Special:PrefixIndex can be transcluded, its contents are stripped, meaning that modules can't parse it.

Right. I was speaking generally here. That is, users can transclude {{Special:PrefixIndex/Foo}} into wiki pages and subsequent page deletions and creations don't cause the servers to explode. In general, listing pages by prefix is pretty cheap, so I'm not sure there would be a huge problem with the performance of Scribunto/Lua modules if this functionality existed.

The difference is that in Lua someone can try to write a loop instead of having to be satisfied with getting just the 200 pages {{Special:PrefixIndex/Foo}} will give you.

In T49137#1463321, I tried to describe the cheaper-for-use implemention, but I am not sure because I'm not a system coder.

In my understanding, this function needs to work only when a page is created or renamed or deleted. Then a tree-table is updated for all pages up and down in the tree. Then each of these pages record the part of the table for all of it's sub-pages.
Later a scribunto function gives to the module the tree-table of the page, with no cost. Only access to any of these pages is expensive.

In fact this is more complex than that because the same saved page can generate dozens or hundreds distinct variants (based on the current user name, or the current language used, and they can all change based on current time if there are uses of magic words like {{CURRENTMINUTE}}, which forces the server-side caches (not just the browser caches) to be given a shorter expiration time (meaning that these pages will be parsed and generated again). MediaWiki sets a minimum expiration time for all pages (to avoid resource attacks), but does not limit the number of languages.

Multiply all these variants by the number of *source* subpages to iterate, this page can generate a huge charge on the server with thousands or tens of thousands page being flushed from the cache: if then a remote user attempts to load all these pages (without even needing to load them completely and wait for their completion), the server charge will suddenly explode (in terms of CPU, not much in term of disk I/O as the Wiki source pages are all the same, but still lot of I/O on the server frontend cache).

But I agree: we could still allow a scribunto parser to get of list a limited number of subpages (e.g. 200) within a range (just like when just transcluding a Prefixindex). This would allow creating pages with navigation buttons to get the next or previous range, overwhich a script could loop.

In my present application the bindmodules() function tries to require("Module:Author/I18N"), across pcall to not fail, for all modules and libraries and their alternative versions, like Module:MathRoman02 and their I18N submodules for i18n tables.

This have already an answer in the Module:Central.
Another MW answer is usefull only if it is not expensive.

I believe understand that MW do not work like a classic PC finder for the subpages. Could a such structure be a MW answer ?
Then a module could ask existing subpages from a give one, just 1 level below, then select some pages, then ask another level ... recursively but under the control of the module.

Note that as of today (Saturday 2020-10-18), any attempt to sort the native language names returned by fetchLanguageNames() (after copying these names into a new sequenced table using table.insert(t, ...)) fails when calling table.sort(t, compare): this is a new bug of Scribunto that changed the interface of table and no longer accepts in parameter a standard comparison function (taking two native language names which should be strings). Visibly, fetchLanguageNames now no longer return standard strings, they are protected including in their references, and do not match the expected signature for the comparison function (this causes an internal error, inside sort(), where the comparison function is called with the wrong parameters, causing a invalid-type error).

A possible work around I tried was to force the conversion of each returned native language name into normal strings (e.g. by concatenating an empty string). I tried it, but this did not work. What is failing is really table.sort(t, compare) in its version tweaked by Scribunto and that no longer supports the table.sort method with *any* function compare(a, b) parameter. So table.sort is now incorrectly bound in the PHP code of Scribunto, it should be able to call a standard Lua function but fails; may be the bug is in the format of the object representing the sequence of strings returned by fetchLanguageNames(). (maybe this table is static and read-only, so it is no longer sortable as this would change the numeric keys; it may also be related to the open bug T49104 "Provide a method to create a non read-only copy of a mw.loadData result").

For now the work around is to use local ok, t2 = perror(function() table.sort(t, compare) end) so that it will catch the error without sorting the table, but this can cause serious issues in various modules depending on sort.

This is not critical if the sort function is just used for generating the UI or display, as elements will just show in an unsorted order, but it may be critical for modules that depend on sort to group together rows containing a column for native language name, e.g. to create aggregates (sums, means...).

So something changed recently in Scribunto. I was pinged to solve this bug in Commons, where Module:Language/List could not process the internal list languages returned by MediaWiki. I used perror to catch the failing sort, but then the returned table is not always fully sorted as it should be (though it's roughtly OK and still consistant in Commons where the languages list is unified and preprocessed with this module that will also cache the result because this is a simple but widely used function for various purposes where users expect to see a consistant ordering of languages everywhere to facilitate their navigation).

Lepticed7 rescinded a token.
Lepticed7 awarded a token.
Lepticed7 added a subscriber: Lepticed7.
Tacsipacsi changed the subtype of this task from "Task" to "Feature Request".Feb 28 2021, 1:42 PM
Tacsipacsi added a subscriber: Tacsipacsi.

To limit the need for cache invalidations, maybe this function could be limited to subpages rather than arbitrary prefixes—for example the equivalent of Special:PrefixIndex/Global message delivery/Targets/ would be something like mw.title.new( 'Global message delivery/Targets' ).subpages. This way, when Global message delivery/Targets/bar/foo is created or deleted, only the subpage list queries for Global message delivery/Targets/bar, Global message delivery/Targets and Global message delivery should be invalidated, while using special pages, Special:PrefixIndex/Global message delivery/Targets/bar/fo also lists it.

Maybe even a (probably optional) depth limit can also be introduced, so that if one only wants to get translations (main use case on Commons), translations’ subpages aren’t listed (and, more importantly, the query doesn’t need to be invalidated if a such subpage is created/deleted). For example, mw.title.new( 'Global message delivery/Targets' ).subpages( 1 ) would return Global message delivery/Targets/bar, but not its subpage Global message delivery/Targets/bar/foo.