Page MenuHomePhabricator

Preloading the categories of a set of pages
Closed, ResolvedPublicFeature

Description

So I had a list of pages in the English Wiktionary and wanted to list only the ones that belonged to a certain category. My script preloaded the pages themselves with APISite.preloadpages and then iterated over their categories with BasePage.categories. But this is very inefficient and takes a long time, because BasePage.categories has to send a separate request to get the list of categories for a page each time it is called. Ideally these would be preloaded along with the rest of the page properties so that the script sends only one request per group of pages.

I searched through the Pywikibot docs and didn't find any way to do this. I also asked @valhallasw on IRC and he confirmed that there's no way to do this currently.

A good solution would be to add a parameter to APISite.preloadpages that would tell it to preload the categories using API:Categories and make them accessible to BasePage.categories on the pages yielded by the iterator. I'm not familiar enough with the internals of Pywikibot to fully work this out though.

@valhallasw on IRC kindly showed me some modifications to site.py that make APISite.preloadpages preload the categories and place the decoded JSON under page._preloaded["categories"] in the page objects, but that's not a long-term solution.

Event Timeline

Xqt triaged this task as Low priority.Jan 2 2020, 6:33 PM
Xqt changed the subtype of this task from "Task" to "Feature Request".

Change 561697 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [FEAT] Add ability to preload categories

https://gerrit.wikimedia.org/r/561697

My script seems no faster with the patch, even with categories = True in the call to APISite.preloadpages. Commented on the part of api.py where separate requests are being sent for each page's categories.

It does not preload the category content itself but the category titles related to the pages. A second request to this information is about 30 times faster. You could create a generator of all categories and preload them in a separate step or combined with the preloading iterator.

Okay, so an additional request must be sent to actually construct the Category objects.

I guess for my script I only need the titles, not the Category objects. It's not quite what the task title asks for, but is there a way to access the preloaded category titles when using APISite.preloadpages and not send the additional request per page to get the category information that I'm not using? This doesn't seem possible in the patch because update_page calls BasePage.categories if categories have been preloaded.

This is the general idea: a function get_category_titles in the following code that does not send an additional request beyond the requests sent by APISite.preloadpages.

def filter_pages_by_categories(pages, category_titles_to_find):
    for page in site.preloadpages(pages, categories = True):
        if any(category_title in category_titles_to_find for category_title in get_category_titles(page)):
            yield page
Aklapper removed Xqt as the assignee of this task.Mar 13 2022, 7:52 PM
Aklapper added a subscriber: Xqt.

Removing task assignee due to inactivity, as this open task has been assigned for more than two years. See the email sent to the task assignee on February 06th 2022 (and T295729).

Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.

If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".

Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.

Xqt assigned this task to Mpaa.

Change 561697 merged by jenkins-bot:

[pywikibot/core@master] [FEAT] Add ability to preload categories

https://gerrit.wikimedia.org/r/561697