Only load item in harvest_template.py when needed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Multichill
	Dec 1 2014, 9:46 PM

Description

At the moment HarvestRobot.treat() in harvest_template.py will always load the item (item.get()). This is quit inefficient if there is no data to import. The bot should look if there is (valid) data to import and if that's the case, load the item. If it's not the case, the bot shouldn't load the item. This should dramatically increase the speed of the bot.

Details

	Subject	Repo	Branch	Lines +/-
	[IMPR] Share claimit.py logic with harvest_template.py	pywikibot/core	master	+151 -134

Customize query in gerrit

Event Timeline

Multichill created this task.Dec 1 2014, 9:46 PM

Multichill assigned this task to Pywikibugs.

Multichill raised the priority of this task from to Medium.

Multichill updated the task description. (Show Details)

Multichill added projects: Pywikibot, Pywikibot-Wikidata.

Multichill changed Security from none to None.

Multichill added a project: Performance Issue.

Multichill subscribed.

item.get is needed before item.claims can be accessed (on the next line).

We could

replace item.claims with a different API call that only gets the list of properties used on the item
extend option 1 to be a generic approach to lazy load item data
move "has claims for all properties" check further down in the process.

The problem with option 3 is that immediately after this check, harvest_template does a page.get() , and page.get() is probably more expensive than item.get(), at least on English Wikipedia where article text size exceeds typical wikidata item JSON size. This may not be true for smaller wikis where the average article text size is smaller (but I would expect it is true for most of the top 10 wikipedia)

jayvdb moved this task from Backlog to Design discussions on the Pywikibot-Wikidata board.Dec 2 2014, 2:13 AM

Multichill added a project: Wikidata.Dec 8 2014, 7:58 PM

JanZerebecki moved this task from incoming to monitoring on the Wikidata board.Jul 23 2015, 9:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 23 2015, 9:11 PM

Xqt removed Pywikibugs as the assignee of this task.Jun 10 2017, 5:23 PM

In T76391#800221, @jayvdb wrote:

page.get() is probably more expensive than item.get()

The bot now uses preloading generator, so page.get() is not expensive at all and even not necessary.

Change 370664 had a related patch set uploaded (by Matěj Suchánek; owner: Matěj Suchánek):
[pywikibot/core@master] [IMPR] Share claimit.py logic with harvest_template.py

https://gerrit.wikimedia.org/r/370664

Restricted Application added a subscriber: PokestarFan. · View Herald TranscriptAug 12 2017, 7:20 AM

gerritbot added a project: Patch-For-Review.Aug 12 2017, 7:20 AM

matej_suchanek claimed this task.Aug 12 2017, 7:46 AM

Change 370664 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Share claimit.py logic with harvest_template.py

https://gerrit.wikimedia.org/r/370664

matej_suchanek closed this task as Resolved.Sep 17 2017, 3:22 PM

matej_suchanek removed a project: Patch-For-Review.