- The WDCM (S)itelinks engine runs on stat1005 are quite heavy;
- Some critical steps (WDQS for subclasses/superclasses, MediaWiki API for large collections of item labels) are error prone and should be made more robust;
- Eliminate all possible confusion in the data model design (item/class distinction must be made explicit in modeling).
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | GoranSMilovanovic | T187396 WDCM for (S)itelinks only | |||
Resolved | GoranSMilovanovic | T203234 Optimize WDCM (S)itelinks engine runs |
Event Timeline
Comment Actions
- This is probably solved by a combination of a BFS run from SERVICE gas:service and a simple SPARQL binding that avoids fetching empty classes.
- Testing now.
Comment Actions
- From one problem into another; in the category of Scientific Articles:
rc <- rawToChar(res$content)
Error in rawToChar(res$content) : long vectors not supported yet: raw.c:68
rc <- rawToChar(res$content, multiple = TRUE)
Error: cannot allocate vector of size 19.3 Gb
Comment Actions
- Ok, resolved by a fall-back to a "less greedy mode" (I've tried to fetch items, their classes, the classes of their classes, and depth from the origin...)
- JSON parsing abandoned, simple regex to extract items only is there and it seems like we're back in the game.
- Still testing. For reasons described in an recent e-mail in relation to T203165, I will wait for the cluster reboot before starting another main engine run.
Comment Actions
- With Map-Reduce, this update runs smoothly.
- If I can fix T200609, we will most probably switch to Spark here.
Comment Actions
- Final adjustments in the projects selection criteria under way;
- the engine will run from stat1007 crontab.
Comment Actions
- One engine update run takes approx. 8h;
- putting on a weekly update update schedule now;
- even if the ETL phase can be further optimized, the ML phase is still critical w. {maptpx};
- it is questionable whether Spark MLlib, MALLET, or anything else, can help in that respect.
Closing; further optimization issues to be tracked under general WDCM tickets.