Page MenuHomePhabricator

Pywikibot; Listpages.py ; -ns: option apparently ignored if -limit: option used...
Open, HighPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Open a Bash notebook.
  • enter the command
pwb.py listpages -usercontribs:"ShakespeareFan00" -intersect  -limit:500 -ns:104 -lang:en -family:wikisource -format:"* [[{page.loc_title}]]"
  • Run the command

What happens?:

Query runs, but contains entries that are clearly not in the namespace explicity given. ( ns 104 is Page namespace on English Wikisource.

What should have happened instead?:

The generated results should have had only entries in the Page: namespace.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:

Event Timeline

ShakespeareFan00 renamed this task from -ns: option apparently ignored if -limit: option used... to Pywikibot; Listpages.py ; -ns: option apparently ignored if -limit: option used....May 10 2022, 2:39 PM

Found these wrong entries:

C:\pwb\GIT\core>pwb.py listpages -usercontribs:"ShakespeareFan00" -intersect  -limit:500 -ns:104 -lang:en -family:wikisource -format:"* [[{page.loc_title}]]"
WARNING: "-intersect" ignored as only one generator is specified.
* [[Index:Iran Air Flight 655 investigation.djvu/styles.css]]
* [[Author:Laura Fry Kready]]
* [[Catholic Encyclopedia (1913)/Congregation of Priests of the Mission]]
* [[Index:As others saw Him.djvu]]
* [[User talk:ShakespeareFan00]]
* [[Category:Page containing non visible comments or annotations]]
* [[User talk:DivermanAU]]
* [[Template:Remarks/doc]]
* [[Template:Remarks]]
* [[Index:Keil and Delitzsch,Biblical commentary the old testament the pentateuch, trad James Martin, volume 1, 1885.djvu]]
* [[Template:*!/c]]
* [[Template:*!/s]]
365 page(s) found

C:\pwb\GIT\core>

There are two problems:

  • API:Usercontribs supports namespace but either we have no api.QueryGenerator here or support_namespace() returns False for the generator item
  • NamespaceFilterPageGenerator definitely yields wrong namspace pages here

Also I note the query here is reporting 365 entries, when the limit is 500, I certainly have more than 500 user contributions on English Wikisource.

@ShakespeareFan00: As I can see, this is a known problem with pagegenerator generator functions; there must be a task already I guess. As a workaround you have to define the namespace first:

pwb.py listpages -ns:104-usercontribs:"ShakespeareFan00" -limit:500 -site:wikisource:en -format:"* [[{page.loc_title}]]"

T222519 ? I am trying your suggested re-ordering..

And your suggested change didn't remove the non Page: namespace entries...

I strongly suggest some kind of option sorting is done BEFORE anything is generated.

Xqt triaged this task as High priority.May 10 2022, 3:19 PM

@ShakespeareFan00: As I can see, this is a known problem with pagegenerator generator functions; there must be a task already I guess. As a workaround you have to define the namespace first:

pwb.py listpages -ns:104-usercontribs:"ShakespeareFan00" -limit:500 -site:wikisource:en -format:"* [[{page.loc_title}]]"

Ah, no: It does not work in the current implementation. I tried it with a patch.

Suggestions: -

Sort the options into a logical order before further processing... I'm not sure if Python will let you do that..

Aside:

It's not clear from the documentation if multiple options for the same filter can be specified... I am assuming it shouldn't be (but I have had dual invocations of -linter work for me.). I am thinking in terms of filter options like -grep: where you might be wanting to filter on two or more things without wanting to write a really complex regexp vs 2 simple ones...

@ShakespeareFan00: Here a code which will work if you give the -ns parameter before the generator parameter. All you have to do is to change this method in pagegenerators:

def _handle_usercontribs(self, value: str) -> HANDLER_RETURN_TYPE:
    """Handle `-usercontribs` argument."""
    self._single_gen_filter_unique = True
    return UserContributionsGenerator(
        value, site=self.site, _filter_unique=None,
        namespaces=self.namespaces)

Not sure whether this will be kept for a final solution or overtaken to Pywikibot 7.3

I strongly suggest some kind of option sorting is done BEFORE anything is generated.

API:Usercontribs does not support any sorting. Seems the sorting order is the edit timestamp then. Sorting on client side is problematic because all entries have to be retrieved from API before the pages can be yielded. This can lead to exhausing memory usage and this is what generators are preventing usually.

Suggestions: -

Sort the options into a logical order before further processing... I'm not sure if Python will let you do that..

That looks like a good idea. That might be possible with pg.handle_args which was introduced in Pywikibot 6.0

Aside:

It's not clear from the documentation if multiple options for the same filter can be specified... I am assuming it shouldn't be (but I have had dual invocations of -linter work for me.). I am thinking in terms of filter options like -grep: where you might be wanting to filter on two or more things without wanting to write a really complex regexp vs 2 simple ones...

Ok that should be made more clear. All generator options as well as filter options can be given several times.

Would it be possible to have a page generator that was effectively -inns:104:A:Z ? I can then use i-intersect to filter it down to manageable levels?

I strongly suggest some kind of option sorting is done BEFORE anything is generated.

API:Usercontribs does not support any sorting. Seems the sorting order is the edit timestamp then. Sorting on client side is problematic because all entries have to be retrieved from API before the pages can be yielded. This can lead to exhausing memory usage and this is what generators are preventing usually.

That's not what I meant.

I meant option sorting in terms of re-ordering the command line options internally so they are processed in an order that makes sense for the options selected, so I don't have to recall a specfic ordering when writing a command line in PAWS for example.

Would it be possible to have a page generator that was effectively -inns:104:A:Z ? I can then use i-intersect to filter it down to manageable levels?

Cannot follow what you mean here. You can specify multiple generator and use the -intersect options which only yields the insection of all of them.

I strongly suggest some kind of option sorting is done BEFORE anything is generated.

API:Usercontribs does not support any sorting. Seems the sorting order is the edit timestamp then. Sorting on client side is problematic because all entries have to be retrieved from API before the pages can be yielded. This can lead to exhausing memory usage and this is what generators are preventing usually.

That's not what I meant.

I meant option sorting in terms of re-ordering the command line options internally so they are processed in an order that makes sense for the options selected, so I don't have to recall a specfic ordering when writing a command line in PAWS for example.

Ah, your suggestion Sort the options into a logical order before further processing... I'm not sure if Python will let you do that... Yes, good idea, see above.

Would it be possible to have a page generator that was effectively -inns:104:A:Z ? I can then use i-intersect to filter it down to manageable levels?

Cannot follow what you mean here. You can specify multiple generator and use the -intersect options which only yields the insection of all of them.

There is a -start Generator, but it's not possible from my reading of the documentation to tell it to only look in a certain namespace.
Nor is there an option seemingly to tell it to stop when it reaches a certain point...

Like for example wanting to list all pages starting with a specfic prefix.. You can say

-start:And_another_thing

but you can't as far as I can see tell it stop once you read
`And_now_for_something_completely_different``
This is different from limiting to just the next 500 entries... Is there a need for an -end option?

From the doc:

-start            You can also include a namespace. For example,
                    "-start:Template:!" will make the bot work on all pages
                    in the template namespace.

But indeed there is no end/stop option now. Please file a new task for it.

But indeed there is no end/stop option now. Please file a new task for it.

See T308112