Page MenuHomePhabricator

scripts/category.py: Allow pagegenerators intersection with actions other than "add"
Closed, ResolvedPublicFeature

Description

I want to remove a category from the intersection list of two other categories using the -intersect argument. (To remove duplicate pages in a supercategory.)

If I run the listpages list with -intersect everything is correct, there are only 12 pages.
pwb.py listpages -family:wikisource -lang:ru -cat:"ОУН-УПА" -cat:"Украина" -intersect

But when I run the same with the category script, the script starts emptying the entire global category "Украина". But it should have been only from those 12 pages.
pwb.py category -family:wikisource -lang:ru -cat:"ОУН-УПА" -cat:"Украина" -intersect remove -from:"Украина".

Note: The problem is more serious, because a bug was found with the -always parameter T318236, so the bot empties the global category without question.

-file argument don't works too

Also, the -file argument does not work to remove the categories by list from the file.

> pwb.py category -family:wikisource -lang:ru -file:/tmp/files.txt remove -from:"Украина"
ERROR: Unknown parameter "-file:/tmp/files.txt".
Use -help for further information.
Retrieving 27 pages from wikisource:ru.
Page [[Гимн Украины]] saved
...

It can be seen that instead of only 12 intersection pages from the list in the file, the global category is emptied.

-cat argument don't works too

The page generator doesn't seem to work there at all. Even with one category.

> pwb.py category -family:wikisource -lang:ru -cat:"ОУН-УПА" remove -from:"Украина"              
ERROR: Unknown parameter "-cat:ОУН-УПА".
Use -help for further information.
Retrieving 27 pages from wikisource:ru.
Page [[Гимн Украины]] saved
Page [[Грамота ко всему украинскому народу (Скоропадский)]] saved
...

Event Timeline

Neither -cat nor -file` is a valid option for category script with remove action:

D:\pwb\GIT\core>pwb.py -simulate category remove -site:wikisource:ru -cat:"ОУН-УПА" -cat:"Украина" -intersect -from:"Украина"
ERROR: Unknown parameters "-cat:ОУН-УПА", "-cat:Украина", "-intersect".
Use -help for further information.

pagegenerator options are only accepted for ('add', 'listify', 'tidy') actions yet. Seems this is a feature request?

All of these scripts are based on pywikibot.pagegenerators and pywikibot.bot, and use common generator arguments.
If this script is somehow strange, could you remove the list of generator commands from the script's help (pwb.py category -help) and from its page https://www.mediawiki.org/wiki/Manual:Pywikibot/category.py#Generators_and_filters_available ?

Xqt triaged this task as Low priority.Sep 21 2022, 4:19 PM
Xqt changed the subtype of this task from "Task" to "Feature Request".

All of these scripts are based on pywikibot.pagegenerators and pywikibot.bot, and use common generator arguments.
If this script is somehow strange, could you remove the list of generator commands from the script's help (pwb.py category -help) and from its page https://www.mediawiki.org/wiki/Manual:Pywikibot/category.py#Generators_and_filters_available ?

I see, the documentation is missleading. I haven't verified whether the manual pages on MediaWiki are valid; surely some of them are outdated. The reference information for the whole framework documentation is placed at https://doc.wikimedia.org/pywikibot/stable/ and the category script description can be found here for example. See also T312992.

Calling pwb category -help shows the message If action is "add", the following additional options are supported: and displays the pagegenerators options. That means, these options are only availlable for add action. tidy and listify only enables the namespaces filter.

Expanding other actions other than add is not trivial because all others need a category already to work on, which means that there must be always an intersection with pagegenerators then. Anyway your suggestion is very usefull and should be implemented I think.

Xqt renamed this task from scripts/category.py: Problem with the page generator to scripts/category.py: Allow pagegenerators intersection with actions other than "add".Sep 21 2022, 4:20 PM

Hm. The category.py help at https://doc.wikimedia.org/pywikibot/stable/scripts/main.html#module-scripts.category also says "This script supports use of pywikibot.pagegenerators arguments."

There is also confusion with the name of the argument.
The script (by page generator) execute for each page the remove -from:category_name command. In human terms, this means: remove this page from the category.
But it turns out that it completely clears the category, removing all pages, with one command without restrictions and the ability to filter the process. Then this command should be called as "clear category".
But the script already has a clean command, which has no description. What it does? Is this a duplicate of remove?

How to remove pages from a category by generator? The only way is to use the replace.py script with the -regex argument?
pwb.py replace -family:wikisource -lang:ru -cat:"ОУН-УПА" -regex '\[\[Категория:Украина\]\]\n?' ''

Change 833995 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [doc] Update category.py documentation

https://gerrit.wikimedia.org/r/833995

Change 833995 merged by Xqt:

[pywikibot/core@master] [doc] Update category.py documentation

https://gerrit.wikimedia.org/r/833995

Xqt changed the task status from Open to In Progress.Sep 23 2022, 8:13 AM
Xqt claimed this task.

Change 834501 had a related patch set uploaded (by Xqt; author: Xqt):

[pywikibot/core@master] [IMPR] Enable pagegenerators options with 'move' and 'remove' actions

https://gerrit.wikimedia.org/r/834501

@Vladis13: Are you able to check the new implementation like
pwb.py -simulate category -site:wikisource:ru -cat:"ОУН-УПА" remove -from:"Украина"

Now the script says "intersection" even when I use only one category:

pwb.py category -site:wikisource:ru -cat:"Дистрикт Галиция" remove -from:"Дивизия СС «Галичина»"  -simulate
Retrieving intersection of generators.
Retrieving 4 pages from wikisource:ru.

It seems to me that it is not necessary to do a local replacement of pagegenerators. That is, the script must: a) pass all parameters to pagegenerators (this may include -intersect, -file or not), b) get a list of pages, c) work on them.

In the commit (and in the title of this thread), you use the word "intersection" a lot. But the point is to support the basic pagegenerators options, which include "intersection" as well. It seems to me that you can simplify the descriptions in the commit.

Now the script says "intersection" even when I use only one category

This is intentionally. The script always retrieves the intersection of the pagegenerators' generators and the category`s articles given with -from option. The reason is that pagegenerators' generators may be either

  • the category's articles itself. The CachedRequest will prohibit duplicate load of pages and there is no disadvantage.
  • a subset of category's articles. The intersect generator halts if all subset articles are processed. In worst case all category's articles has to be processed if the subset elements are found at the end of the of the category's articles.
  • a superset of category's articles. The intersect generator halts if all articles are processed. In worst case all superset articles has to be processed if the articles elements are found at the end of the of the superset.
  • a distinct set of articles. In this case both generators of the intersection have to pre processed.

But we do not preload the pages during intersection and the generators are bulk loaded with upto 5000 items.

The reason for using this intersection is that it has to be verified whether the pages from pagesgenerators are members of the given category anyway. The only other way would be to use textlib.getCategoryLinks to get the category links for each page given by the pagegenerators. But preloading the content is necessary in that case and I am not sure whether this is faster then but it uses a lot of memory.

I made a worst case check with 5376 pages from pagegenerators and 94 pages for the -from category:

Implementationtime used
current implementation12-13 s
same as above but duplicates are allowed11-12 s
textlib.getCategoryLinks to check category members76-97 s
same as above but preload=True for pagegenerators68-74 s

Conclusion: the current implementation is about eight times faster than using textlib and I haven't found any faster way (I also haven't used expanded templates for parsing wikitext)

Change 834501 merged by jenkins-bot:

[pywikibot/core@master] [IMPR] Enable pagegenerators options with 'move' and 'remove' actions

https://gerrit.wikimedia.org/r/834501

Xqt removed a project: Patch-For-Review.