We want to make value & reference URIs inlined, possibly using some Blazegraph extensions and same techniques that are used for IPs or UUIDs.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Smalyshev | T213210 WDQS is hitting allocator limit on Blazegraph | |||
Resolved | Igorkim78 | T235759 [TRACKING] WDQS / Blazegraph optimization / bug fixes | |||
Resolved | Igorkim78 | T213375 Inline value and reference URIs |
Event Timeline
Change 505642 had a related patch set uploaded (by Igor Kim; owner: Igor Kim):
[wikidata/query/blazegraph@master] Inline value and reference URIs handler
Changeset created to support reference URIs inlining:
https://gerrit.wikimedia.org/r/#/c/wikidata/query/blazegraph/+/505642
Baseline collected for performance test:
Data files loaded: 100 ttl gz files into an empty journal
Total size of ttl.gz files: 7.9GB
Number of triples in the journal: 1,114,293,494
Size of the journal: 116,122,910,720 bytes (100GB)
Subjects in the journal: 135,711,041
Reference subjects in the journal: 7,206,942 (5.3% of all subjects)
Allocators count: 35,557
Slots allocated: 494,046,565
Load performance:
Load time: 156278 seconds (43 hours)
Load performance Average: 7122 mutations per second
Load performance Stabilized (last 10 files): 4170 mutations per second
Query performance measured for simple query select * {?s ?p ?o } with ?s bound to random subject from two sets, reference URIs and all other URIs except statement URIs:
For reference URIs:
Stabilized query performance after ~150K queries: 80 qps with average of 4 rows per result set
Normalized query performance 320 rows per second.
For other URIs:
Stabilized query performance after ~150K queries: 70 qps with average of 5 rows per result set
Normalized query performance 350 rows per second.
In progress: reload journal with configurations:
- reference URIs inlining,
- reference URIs inlining, raw records disabled per T213210
- reference URIs inlining, raw records disabled, INLINE_TEXT_LITERALS for short strings per T213210
and compare results with the baseline.
Change 506045 had a related patch set uploaded (by Igor Kim; owner: Igor Kim):
[wikidata/query/rdf@master] Apply InlineFixedWidthHexIntegerURIHandler for reference URIs
Attached results of the load 100 ttl.gz files with different configurations
- original baseline (commit blazegraph 895a4f3bd003ddb4b1f31257f642ca3616bca79b, rdf 4245b2a5bc0c7d4b369a43ba512b5e537dac07a4)
- reference URIs inlining,
- reference URIs inlining, raw records disabled per T213210
- reference URIs inlining, raw records disabled, INLINE_TEXT_LITERALS for short strings per T213210
Conclusions, comparing to original baseline:
- Inlining of reference and value URIs takes 22% more time, produces journal of 10% more bytes, 1.7% less allocations but their overall size is 8.6% more, though the are 58% more blobs allocations with 66% more size.
- Inlining of reference and value URIs with raw records disabled takes 20% more time, produces journal of the same size, 77% less allocations with overall size 29% less, though the are 61% more blobs allocations with 73% more size.
- Inlining of reference and value URIs and literals (less than 40chars) with raw records disabled takes 66% more time, produces journal of 21% more bytes, 75% less allocations with overall size 2% less, though the are 234% more blobs allocations with 382% more size.
Result:
Configuration Option of Inlining of reference and value URIs with raw records disabled might be considered to reduce allocations count, but all tested configurations results in more allocations for BLOBs.
Load performance for the tested configurations on isolated environment (i7-7700HQ, 8 cores 2.8GHz, 32GB RAM, SSD Samsung 960 PRO)
Query performance on simple queries (select * from {?s ?p ?o .} with ?s bound to random subject URI) does not show any significant changes for the tested journal configurations.
Probably more complex query mix should be applied for the journals to see the difference.
Change 505642 abandoned by Smalyshev:
Inline value and reference URIs handler
Reason:
superceded by https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/ /506045
Configuration options are assigned in RWStore.properties. Particular options are:
- Inlined Value and Reference URIs:
com.bigdata.rdf.store.AbstractTripleStore.inlineURIFactory=org.wikidata.query.rdf.blazegraph.WikibaseInlineUriFactory$V001
- Raw records support disabled:
com.bigdata.rdf.store.AbstractTripleStore.enableRawRecordsSupport=false
- Inlining of short text literals, Max Length has to be assinged as a threshold to inline literals:
com.bigdata.rdf.store.AbstractTripleStore.inlineTextLiterals=true
com.bigdata.rdf.store.AbstractTripleStore.maxInlineTextLength=40
A combination of parameters might be applied. The most promising combination is inlining of Value and Reference URIs and disabling Raw Records support:
com.bigdata.rdf.store.AbstractTripleStore.inlineURIFactory=org.wikidata.query.rdf.blazegraph.WikibaseInlineUriFactory$V001
com.bigdata.rdf.store.AbstractTripleStore.enableRawRecordsSupport=false
Additionally tested configuration option with only Raw records disabled, comparing to original baseline:
- takes 1.7% more time, produces journal of 9.2% less bytes, 77% less allocations with their overall size 38.9% less, though the are 2.9% more blobs allocations with 7.5% more size.
This might be an option to consider, even without value and reference URIs inlining.
Change 506045 merged by jenkins-bot:
[wikidata/query/rdf@master] Inline value and reference URIs without core blazegraph changes
Change 516587 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Disable raw records in config
Change 516587 merged by jenkins-bot:
[wikidata/query/rdf@master] Disable raw records in config