Page MenuHomePhabricator

Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization
Closed, ResolvedPublic

Description

Run this code:

>>> import pywikibot
>>> repo = pywikibot.Site('wikidata', 'wikidata')
>>> item = pywikibot.ItemPage(repo, 'Q16503')
>>> data = item.get()

The last line will take many seconds while the respective API call takes a while. The reason is that during this operation all sitelinks are initialized AND (some of them) parsed in SiteLink._parse_namespace which a) creates a new site object via APISite.fromDBName (not a cached one as pywikibot.Site would do), b) does an API call for each site to get the namespace information (this can be very slow for many sites). Note that combination of both caused my bot to crash on MemoryError, with trace to these methods.

This all is quite unexpected for bot operators who don't care about sitelinks (or who do but not about what namespace they link to). Some lazy initialization should be introduced, probably in all fromDBName, SiteLink and ItemPage.

Event Timeline

Xqt triaged this task as High priority.Jun 20 2019, 11:59 AM

Change 518025 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Create APISite object through pywikibot.Site wrapper

https://gerrit.wikimedia.org/r/518025

Change 518025 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Create APISite object through pywikibot.Site wrapper

https://gerrit.wikimedia.org/r/518025

Xqt removed Xqt as the assignee of this task.Jul 29 2019, 10:33 AM
Xqt subscribed.

Bump as it affects the usability of newitem.py.

Simple reproduce: pywikibot.ItemPage(pywikibot.Site("wikidata","wikidata"),"Q4847311").get()

Xqt renamed this task from Data retrieval may be very long and heavy because of SiteLink initialization to Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization.Jan 19 2021, 5:54 AM

Change 656997 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

Change 656997 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

This still does not handle a problem: A user is usually not registered in all wikis, and querying data without a user account will result in error. Therefore, if you try to run this script, it will almost always return a pywikibot.exceptions.NoUsername error. In addition, the first load of site informations is still slow. There are two things to be done:

  1. Allow page object be lazy loaded (not to be confuse with preload), so that no site object is created at all. Usually when user querying an item only few or no sitelink is actually used.
  2. Introduce an "anonymouse mode" in Pywikibot, i.e. allow script to fetch site or page information (as read-only) without a registered account.

This still does not handle a problem: A user is usually not registered in all wikis, and querying data without a user account will result in error. Therefore, if you try to run this script, it will almost always return a pywikibot.exceptions.NoUsername error. In addition, the first load of site informations is still slow. There are two things to be done:

  1. Allow page object be lazy loaded (not to be confuse with preload), so that no site object is created at all. Usually when user querying an item only few or no sitelink is actually used.
  2. Introduce an "anonymouse mode" in Pywikibot, i.e. allow script to fetch site or page information (as read-only) without a registered account.

This is just an intermediate "solution" which is able to decrease processing time of other scripts because it preloads and caches siteinfo settings. In compat we had large family files for these settings. Creating the site object is not the problem; it does not cause any api loads. The underlying problem is link parsing which needs an api call to retrieve namespace aliases for each affected site.

Change 656997 merged by jenkins-bot:
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

Change 660758 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] speed up preload_sites.py maintenance script

https://gerrit.wikimedia.org/r/660758

Change 660809 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

Change 660758 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] speed up preload_sites.py maintenance script

https://gerrit.wikimedia.org/r/660758

Xqt claimed this task.

Change 660809 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809