Page MenuHomePhabricator

Optimize WDCM (S)itelinks engine runs
Closed, ResolvedPublic

Description

  • The WDCM (S)itelinks engine runs on stat1005 are quite heavy;
  • Some critical steps (WDQS for subclasses/superclasses, MediaWiki API for large collections of item labels) are error prone and should be made more robust;
  • Eliminate all possible confusion in the data model design (item/class distinction must be made explicit in modeling).

Event Timeline

GoranSMilovanovic created this task.
  • This is probably solved by a combination of a BFS run from SERVICE gas:service and a simple SPARQL binding that avoids fetching empty classes.
  • Testing now.
  • From one problem into another; in the category of Scientific Articles:
rc <- rawToChar(res$content)
Error in rawToChar(res$content) :
  long vectors not supported yet: raw.c:68
rc <- rawToChar(res$content, multiple = TRUE)
Error: cannot allocate vector of size 19.3 Gb
  • Ok, resolved by a fall-back to a "less greedy mode" (I've tried to fetch items, their classes, the classes of their classes, and depth from the origin...)
  • JSON parsing abandoned, simple regex to extract items only is there and it seems like we're back in the game.
  • Still testing. For reasons described in an recent e-mail in relation to T203165, I will wait for the cluster reboot before starting another main engine run.
  • Test successfull. We have our WDCM categories back.
  • Continue implementations on T187396, let T203389 wait until the cluster reboot tomorrow.
  • With Map-Reduce, this update runs smoothly.
  • If I can fix T200609, we will most probably switch to Spark here.
  • Final adjustments in the projects selection criteria under way;
  • the engine will run from stat1007 crontab.
  • selection criteria adjusted;
  • ML phase now.
  • One engine update run takes approx. 8h;
  • putting on a weekly update update schedule now;
  • even if the ETL phase can be further optimized, the ML phase is still critical w. {maptpx};
  • it is questionable whether Spark MLlib, MALLET, or anything else, can help in that respect.

Closing; further optimization issues to be tracked under general WDCM tickets.