Page MenuHomePhabricator

'-titleregex' does not handle namespaces other than 0
Closed, ResolvedPublic

Description

When -titleregex is called with a namespace other than 0 (with GeneratorFactory), no pages are returned. Only pages with namespace 0 are ever returned.

In GeneratorFactory, -titleregex calls RegexFilterPageGenerator (wich is really RegexFilter.titlefilter) with an argument of site.allpages(). allpages defaults to namespace 0. When getCombinedGenerator() is called, the generator (already filled with namespaces of 0) is not considered a pywikibot.data.api.QueryGenerator and is filtered out with NamespaceFilterPageGenerator for the appropriate namespaces, either resulting empty results or only with namespaces 0.

Event Timeline

Daviskr raised the priority of this task from to Medium.
Daviskr updated the task description. (Show Details)
Daviskr subscribed.

Another side effect of the current setup is that it fetches all pages (that match the regex) before it applies limit*. This causes extreme slowdown as seen in this build.

*: As Mpaa said below, this only applies to namespaces other than zero.

Another side effect of the current setup is that it fetches all pages (that match the regex) before it applies limit. This causes extreme slowdown as seen in this build.

One clarification: only if a namespace different from 0 is specified, as it is done in the test, for the reason explained above: no pages will be yielded at all, so the test will end only when all the pages in the specified ns have been fetched by allpages().

This comment was removed by Mpaa.

Change 181993 had a related patch set uploaded (by Mpaa):
Pagegenerators.py: ns handling for titleregex option

https://gerrit.wikimedia.org/r/181993

Patch-For-Review

After T57226 is merged, we can re-purpose this task to track the underlying problem; I suspect we want to wait until argparse has landed before fixing the real problem, as it should be much simpler then.

See also T114015, where it is suggested to change how titleregex works.