Page MenuHomePhabricator

Only load item in harvest_template.py when needed
Closed, ResolvedPublic

Description

At the moment HarvestRobot.treat() in harvest_template.py will always load the item (item.get()). This is quit inefficient if there is no data to import. The bot should look if there is (valid) data to import and if that's the case, load the item. If it's not the case, the bot shouldn't load the item. This should dramatically increase the speed of the bot.

Event Timeline

Multichill assigned this task to Pywikibugs.
Multichill raised the priority of this task from to Medium.
Multichill updated the task description. (Show Details)
Multichill changed Security from none to None.
Multichill added a project: Performance Issue.
Multichill subscribed.

item.get is needed before item.claims can be accessed (on the next line).

We could

  1. replace item.claims with a different API call that only gets the list of properties used on the item
  2. extend option 1 to be a generic approach to lazy load item data
  3. move "has claims for all properties" check further down in the process.

The problem with option 3 is that immediately after this check, harvest_template does a page.get() , and page.get() is probably more expensive than item.get(), at least on English Wikipedia where article text size exceeds typical wikidata item JSON size. This may not be true for smaller wikis where the average article text size is smaller (but I would expect it is true for most of the top 10 wikipedia)

page.get() is probably more expensive than item.get()

The bot now uses preloading generator, so page.get() is not expensive at all and even not necessary.

Change 370664 had a related patch set uploaded (by Matěj Suchánek; owner: Matěj Suchánek):
[pywikibot/core@master] [IMPR] Share claimit.py logic with harvest_template.py

https://gerrit.wikimedia.org/r/370664

Change 370664 merged by jenkins-bot:
[pywikibot/core@master] [IMPR] Share claimit.py logic with harvest_template.py

https://gerrit.wikimedia.org/r/370664