Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matej_suchanek
	Jun 20 2019, 8:10 AM

Description

Run this code:

>>> import pywikibot
>>> repo = pywikibot.Site('wikidata', 'wikidata')
>>> item = pywikibot.ItemPage(repo, 'Q16503')
>>> data = item.get()

The last line will take many seconds while the respective API call takes a while. The reason is that during this operation all sitelinks are initialized AND (some of them) parsed in SiteLink._parse_namespace which a) creates a new site object via APISite.fromDBName (not a cached one as pywikibot.Site would do), b) does an API call for each site to get the namespace information (this can be very slow for many sites). Note that combination of both caused my bot to crash on MemoryError, with trace to these methods.

This all is quite unexpected for bot operators who don't care about sitelinks (or who do but not about what namespace they link to). Some lazy initialization should be introduced, probably in all fromDBName, SiteLink and ItemPage.

Details

Subject	Repo	Branch	Lines +/-
[IMPR] Create a SiteLink with __getitem__ method	pywikibot/core	master	+20 -12
[IMPR] speed up preload_sites.py maintenance script	pywikibot/core	master	+7 -5
[maintenance] Add a preload_sites.py script to preload site informations	pywikibot/core	master	+90 -1
[bugfix] Create APISite object through pywikibot.Site wrapper	pywikibot/core	master	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Amire80	T70071 [Compact links] Prioritise interwikis to featured pages
Declined	None	T70067 Tell in sitelink whether target is the "preferred" language for the topic/place of the article
Resolved	Addshore	T42810 Wikibase badges (tracking)
Resolved	Lokal_Profil	T128202 Implement badges support in pywikibot
Invalid	None	T72936 Important tasks to be solved (tracking)
Resolved	Lokal_Profil	T66457 refactor sitelinks structure to support badges
Resolved	Xqt	T273386 newitem.py has very long start
Resolved	Xqt	T238471 Performance problems with pywikibot's pagegenerator
Resolved	Xqt	T226157 Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization
Resolved	matej_suchanek	T245809 Make Wikibase entities load and initialize lazily

Event Timeline

matej_suchanek created this task.Jun 20 2019, 8:10 AM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptJun 20 2019, 8:10 AM

matej_suchanek added a parent task: T66457: refactor sitelinks structure to support badges.Jun 20 2019, 8:11 AM

Xqt triaged this task as High priority.Jun 20 2019, 11:59 AM

Xqt claimed this task.Jun 20 2019, 12:09 PM

Change 518025 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Create APISite object through pywikibot.Site wrapper

https://gerrit.wikimedia.org/r/518025

gerritbot added a project: Patch-For-Review.Jun 20 2019, 12:49 PM

Change 518025 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Create APISite object through pywikibot.Site wrapper

https://gerrit.wikimedia.org/r/518025

Xqt mentioned this in rPWBC9316e8117aac: [bugfix] Create APISite object through pywikibot.Site wrapper.Jul 29 2019, 2:47 AM

Maintenance_bot removed a project: Patch-For-Review.Jul 29 2019, 3:10 AM

Xqt removed Xqt as the assignee of this task.Jul 29 2019, 10:33 AM

Xqt subscribed.

matej_suchanek mentioned this in T238471: Performance problems with pywikibot's pagegenerator.Nov 17 2019, 12:41 PM

matej_suchanek mentioned this in T245809: Make Wikibase entities load and initialize lazily.Feb 21 2020, 9:29 AM

matej_suchanek added a subtask: T245809: Make Wikibase entities load and initialize lazily.

matej_suchanek moved this task from Backlog to Data loading problems on the Pywikibot-Wikidata board.Feb 21 2020, 9:34 AM

matej_suchanek mentioned this in T249692: Pywikibot crashes when _WbDataPage is initialized with non-existing target.Apr 8 2020, 7:35 AM

matej_suchanek mentioned this in T252306: PAWS gives API-errors on some cases.May 22 2020, 12:55 PM

Chicocvenancio subscribed.Sep 25 2020, 12:36 PM

Bump as it affects the usability of newitem.py.

Simple reproduce: pywikibot.ItemPage(pywikibot.Site("wikidata","wikidata"),"Q4847311").get()

Bugreporter mentioned this in T238405: Pywikibot should have a robust way to handle unknown sites.Jan 17 2021, 7:09 AM

Xqt closed subtask T245809: Make Wikibase entities load and initialize lazily as Resolved.Jan 18 2021, 6:43 AM

Xqt renamed this task from Data retrieval may be very long and heavy because of SiteLink initialization to Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization.Jan 19 2021, 5:54 AM

Change 656997 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

gerritbot added a project: Patch-For-Review.Jan 19 2021, 6:46 AM

In T226157#6756845, @gerritbot wrote:

Change 656997 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

This still does not handle a problem: A user is usually not registered in all wikis, and querying data without a user account will result in error. Therefore, if you try to run this script, it will almost always return a pywikibot.exceptions.NoUsername error. In addition, the first load of site informations is still slow. There are two things to be done:

Allow page object be lazy loaded (not to be confuse with preload), so that no site object is created at all. Usually when user querying an item only few or no sitelink is actually used.
Introduce an "anonymouse mode" in Pywikibot, i.e. allow script to fetch site or page information (as read-only) without a registered account.

In T226157#6762542, @Bugreporter wrote:

This still does not handle a problem: A user is usually not registered in all wikis, and querying data without a user account will result in error. Therefore, if you try to run this script, it will almost always return a pywikibot.exceptions.NoUsername error. In addition, the first load of site informations is still slow. There are two things to be done:

Allow page object be lazy loaded (not to be confuse with preload), so that no site object is created at all. Usually when user querying an item only few or no sitelink is actually used.

Introduce an "anonymouse mode" in Pywikibot, i.e. allow script to fetch site or page information (as read-only) without a registered account.

This is just an intermediate "solution" which is able to decrease processing time of other scripts because it preloads and caches siteinfo settings. In compat we had large family files for these settings. Creating the site object is not the problem; it does not cause any api loads. The underlying problem is link parsing which needs an api call to retrieve namespace aliases for each affected site.

Change 656997 merged by jenkins-bot:
[pywikibot/core@master] [maintenance] Add a preload_sites.py script to preload site informations

https://gerrit.wikimedia.org/r/656997

Xqt mentioned this in rPWBCf3de619637fe: [maintenance] Add a preload_sites.py script to preload site informations.Jan 26 2021, 9:07 AM

Maintenance_bot removed a project: Patch-For-Review.Jan 26 2021, 9:10 AM

Xqt mentioned this in T273386: newitem.py has very long start.Jan 31 2021, 5:02 PM

Xqt added a parent task: T273386: newitem.py has very long start.Jan 31 2021, 6:16 PM

Xqt added a parent task: T238471: Performance problems with pywikibot's pagegenerator.Jan 31 2021, 6:20 PM

Change 660758 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] speed up preload_sites.py maintenance script

https://gerrit.wikimedia.org/r/660758

gerritbot added a project: Patch-For-Review.Feb 1 2021, 8:14 AM

Xqt reopened subtask T245809: Make Wikibase entities load and initialize lazily as Open.Feb 1 2021, 10:27 AM

Change 660809 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

Change 660758 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] speed up preload_sites.py maintenance script

https://gerrit.wikimedia.org/r/660758

Xqt mentioned this in rPWBC91d7c4182de7: [IMPR] speed up preload_sites.py maintenance script.Feb 2 2021, 8:47 AM

Xqt closed this task as Resolved.Feb 3 2021, 9:46 AM

Xqt claimed this task.

Xqt closed subtask T245809: Make Wikibase entities load and initialize lazily as Resolved.

Change 660809 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

Xqt mentioned this in rPWBC51b119a1bbf3: [IMPR] Create a SiteLink with __getitem__ method.Feb 3 2021, 10:00 AM

matej_suchanek awarded a token.Feb 3 2021, 10:04 AM

Maintenance_bot removed a project: Patch-For-Review.Feb 3 2021, 10:10 AM

Data retrieval may be very long and heavy because of parsing Link during SiteLink initializationClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization
Closed, ResolvedPublic
Actions

Related Objects
Search...