newitem.py has very long start
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	JAnD
	Jan 31 2021, 11:12 AM

Description

probably after solving T256676 when running newitem.py, bot first for very long time retrieves list of skipping templates.
It looks like bot is retrieving all skipping template for all wikis, because i found in log:

 Retrieving skipping templates for site wikiquote:cs...
2021-01-18 17:53:59       _basesite.py,   77 in           __init__: VERBOSE  Site wikipedia:aa instantiated and marked "obsolete" to prevent access
2021-01-18 17:53:59       _basesite.py,   77 in           __init__: VERBOSE  Site wikibooks:aa instantiated and marked "obsolete" to prevent access
2021-01-18 17:53:59       _basesite.py,   77 in           __init__: VERBOSE  Site wiktionary:aa instantiated and marked "obsolete" to prevent access

Details

	Subject	Repo	Branch	Lines +/-
	[IMPR] Create a SiteLink with __getitem__ method	pywikibot/core	master	+20 -12

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Xqt	T273386 newitem.py has very long start
Resolved	Xqt	T226157 Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization
Resolved	matej_suchanek	T245809 Make Wikibase entities load and initialize lazily

Event Timeline

JAnD created this task.Jan 31 2021, 11:12 AM

Restricted Application added a project: Pywikibot. · View Herald TranscriptJan 31 2021, 11:12 AM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

JAnD updated the task description. (Show Details)Jan 31 2021, 11:13 AM

Xqt triaged this task as Medium priority.Jan 31 2021, 4:01 PM

Xqt added a project: Performance Issue.

Xqt subscribed.Jan 31 2021, 4:04 PM

This comment was removed by Xqt.

This is the same problem as noted in T226157. To parse links at Wikidata siteinfo content is required. These are cached inside apicache-py3 folder and expire usually after 30 days. See the following examples:

(A) loading siteinfo via api for a clean apicache-py3 Folder

>>> import pywikibot
>>> from scripts.newitem import NewItemRobot
>>> import pywikibot
>>> bot = NewItemRobot([])
>>> def f(bot):
	from datetime import datetime
	site = pywikibot.Site('cs')
	start = datetime.now()
	temp = bot.get_skipping_templates(site)
	print('Time used:', datetime.now() - start)

	
>>> f(bot)
Retrieving skipping templates for site wikipedia:cs...
WARNING: C:\pwb\GIT\core\pywikibot\tools\__init__.py:1479: UserWarning: Site wikipedia:be-tarask instantiated using different code "be-x-old"
  return obj(*__args, **__kw)

Time used: 0:06:01.608467

This means siteinfo content load needs 6 minutes until completed

(B) Try a second call for this function:

>>> f(bot)
Time used: 0:00:00

As expected all templates are hold by the bot instance.

(C) delete instance Cache and try again

>>> bot._skipping_templates = {}
>>> f(bot)
Retrieving skipping templates for site wikipedia:cs...
Time used: 0:00:03.309496

The content is fetched from apicache-py3 folder in 3 seconds only

(D) use prelodsites for an empty apicache-py3

C:\pwb\GIT\core>pwb preload_sites
Preloading sites of wikibooks family...
Preloading sites of wikinews family...
Preloading sites of wikipedia family...
Preloading sites of wikiquote family...
Preloading sites of wikisource family...
Preloading sites of wikiversity family...
Preloading sites of wikivoyage family...
Preloading sites of wiktionary family...
Preloading sites of wikiversity family completed.
Preloading sites of wikivoyage family completed.
Preloading sites of wikinews family completed.
Preloading sites of wikisource family completed.
Preloading sites of wikiquote family completed.
Preloading sites of wikibooks family completed.
Preloading sites of wiktionary family completed.
Preloading sites of wikipedia family completed.
Loading time used: 0:02:13.395826

Preloading needs 2.2 minutes only vs. 6 minutes via script. Now check the script loading time:

>>> bot._skipping_templates = {}
>>> f(bot)
Retrieving skipping templates for site wikipedia:cs...
Time used: 0:00:04.476267

It's again few seconds because the content is already in apicache-py3.

Conclusion

I propose to use preload_sites.py maintenance script to preload siteinfo content until we have a better solution for parsing links. You may start it as batch e.g. monthly (because the expiry time is 30 days) or earlier. To force preloading you can use the global option -API_config_expiry.

preload_sites.py is a maintenance script added to current master 6.0.0.dev0. It uses threads to load few siteinfo contents simultaneously. The number of parallel vorkers can be given by -worker option but is normally not necessary. The default setting depends on the number of processors of the machine.

Usage:

python pwb.py [-API_config_expiry:{<num>}] preload_sites [{<family>}]* [-worker{<num>}]

Xqt added a subtask: T226157: Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization.Jan 31 2021, 6:16 PM

Note most of WMF sites are never relevant to most bot workers. When retrieving skipping templates, only one sitelink is useful.

In T273386#6790655, @Bugreporter wrote:

Note most of WMF sites are never relevant to most bot workers. When retrieving skipping templates, only one sitelink is useful.

The current implementation needs siteinfo for each listed sitelink for parsing namespaces. Anyway I agree with you and we should find a better solution in this matter. For the meantime preload_sites will help; I improved that script in https://gerrit.wikimedia.org/r/c/pywikibot/core/+/660758 and I have needef 20 seconds to preload all siteinfo for 748 sites with 25 workers.

Change 660809 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

gerritbot added a project: Patch-For-Review.Feb 1 2021, 11:38 AM

When fixing this, make sure you test it with a new account (specifically, not attached on all possible projects).

In T273386#6792344, @Bugreporter wrote:

When fixing this, make sure you test it with a new account (specifically, not attached on all possible projects).

Could you explain? What is the reason for a new account?

Most bots does not have account on all wikis, and when fetching skipping templates, all sitelinks of Template:Delete is fetched, which contains sitelinks to various sites the bot does not have accounts. This will make the script unusable. Note if preload_sites.py loads data from all possible sites, it does not solve the problem, unless Pywikibot supports an "anonymous mode".

Creating a Site object does not require any api request. Calling get() method with a ImagePage instance lodes the content from wikibase. With the current implementation all sitelinks are parsed and therefore siteinfo is needed for every site. With the patch above the sitelinks aren’t parsed anymore if reading the wikibase content; it is parsed when the sitelinks is used e.g. to create the corresponding Page.

I tested the behaviour with a brand new account andit worked fine.

Xqt closed subtask T226157: Data retrieval may be very long and heavy because of parsing Link during SiteLink initialization as Resolved.Feb 3 2021, 9:46 AM

Change 660809 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Create a SiteLink with getitem method

https://gerrit.wikimedia.org/r/660809

Xqt closed this task as Resolved.Feb 3 2021, 9:46 AM

Xqt claimed this task.

Xqt mentioned this in rPWBC51b119a1bbf3: [IMPR] Create a SiteLink with __getitem__ method.Feb 3 2021, 10:00 AM

Maintenance_bot removed a project: Patch-For-Review.Feb 3 2021, 10:11 AM

newitem.py has very long startClosed, ResolvedPublicActions