Page MenuHomePhabricator

Create a parallel loader to improve load performance for WDQS / Blazegraph
Closed, DeclinedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Performance measured on dump from 20191202: https://dumps.wikimedia.org/wikidatawiki/entities/20191202/
Baseline tIme to load: 4264m29.914s, 714218864640 bytes

Improvements proposed:

  1. One-path loading (when data is loaded into SPO index only and POS, OSP are recreated in parallel afterwards).

One-path time to load: 1755m57.082s (41.2% of baseline), 402815582208 bytes (56.4% of baseline)
Indices recreation: In progress.

  1. Data to be loaded is parsed in parallel, creating StatementBuffer instances, which then are queued for load into DB.

To be done.

@Aklapper , Thank you! Fixed the commit message.

That's going to be part of the upcoming hadoop pipeline