Page MenuHomePhabricator

Inline value and reference URIs
Closed, ResolvedPublic

Description

We want to make value & reference URIs inlined, possibly using some Blazegraph extensions and same techniques that are used for IPs or UUIDs.

Event Timeline

Smalyshev triaged this task as High priority.Jan 10 2019, 5:57 AM
Smalyshev lowered the priority of this task from High to Medium.
Smalyshev created this task.

Change 505642 had a related patch set uploaded (by Igor Kim; owner: Igor Kim):
[wikidata/query/blazegraph@master] Inline value and reference URIs handler

https://gerrit.wikimedia.org/r/505642

Igorkim78 claimed this task.EditedApr 22 2019, 4:55 PM
Igorkim78 added a subscriber: Igorkim78.

Changeset created to support reference URIs inlining:
https://gerrit.wikimedia.org/r/#/c/wikidata/query/blazegraph/+/505642

Baseline collected for performance test:
Data files loaded: 100 ttl gz files into an empty journal
Total size of ttl.gz files: 7.9GB
Number of triples in the journal: 1,114,293,494
Size of the journal: 116,122,910,720 bytes (100GB)
Subjects in the journal: 135,711,041
Reference subjects in the journal: 7,206,942 (5.3% of all subjects)

Allocators count: 35,557
Slots allocated: 494,046,565

Load performance:
Load time: 156278 seconds (43 hours)
Load performance Average: 7122 mutations per second
Load performance Stabilized (last 10 files): 4170 mutations per second

Query performance measured for simple query select * {?s ?p ?o } with ?s bound to random subject from two sets, reference URIs and all other URIs except statement URIs:
For reference URIs:
Stabilized query performance after ~150K queries: 80 qps with average of 4 rows per result set
Normalized query performance 320 rows per second.
For other URIs:
Stabilized query performance after ~150K queries: 70 qps with average of 5 rows per result set
Normalized query performance 350 rows per second.

In progress: reload journal with configurations:

  • reference URIs inlining,
  • reference URIs inlining, raw records disabled per T213210
  • reference URIs inlining, raw records disabled, INLINE_TEXT_LITERALS for short strings per T213210

and compare results with the baseline.

Change 506045 had a related patch set uploaded (by Igor Kim; owner: Igor Kim):
[wikidata/query/rdf@master] Apply InlineFixedWidthHexIntegerURIHandler for reference URIs

https://gerrit.wikimedia.org/r/506045

Attached results of the load 100 ttl.gz files with different configurations

Conclusions, comparing to original baseline:

  • Inlining of reference and value URIs takes 22% more time, produces journal of 10% more bytes, 1.7% less allocations but their overall size is 8.6% more, though the are 58% more blobs allocations with 66% more size.
  • Inlining of reference and value URIs with raw records disabled takes 20% more time, produces journal of the same size, 77% less allocations with overall size 29% less, though the are 61% more blobs allocations with 73% more size.
  • Inlining of reference and value URIs and literals (less than 40chars) with raw records disabled takes 66% more time, produces journal of 21% more bytes, 75% less allocations with overall size 2% less, though the are 234% more blobs allocations with 382% more size.

Result:
Configuration Option of Inlining of reference and value URIs with raw records disabled might be considered to reduce allocations count, but all tested configurations results in more allocations for BLOBs.

Load performance for the tested configurations on isolated environment (i7-7700HQ, 8 cores 2.8GHz, 32GB RAM, SSD Samsung 960 PRO)

Query performance on simple queries (select * from {?s ?p ?o .} with ?s bound to random subject URI) does not show any significant changes for the tested journal configurations.
Probably more complex query mix should be applied for the journals to see the difference.

Complete test logs attached

Queries for tests:

1SELECT ?item WHERE {
2 ?item wdt:P31 wd:Q11879590 .
3 FILTER NOT EXISTS {
4 ?item wdt:P31 wd:Q4167410
5 }
6 OPTIONAL {
7 ?item schema:description ?des .
8 FILTER ((LANG(?des)) = "ar")
9 }
10 FILTER (!BOUND(?des))
11}
12LIMIT 5000
13================
14SELECT ?l ?lemma ?posLabel WHERE {
15 ?l a ontolex:LexicalEntry ;
16 dct:language ?language ;
17 wikibase:lemma ?lemma .
18 ?language wdt:P424 'fr' .
19 OPTIONAL {
20 ?l wikibase:lexicalCategory ?pos .
21 SERVICE wikibase:label
22 {
23 bd:serviceParam wikibase:language "en" .
24 }
25 }
26 FILTER NOT EXISTS {
27 ?l ontolex:sense ?sense
28 }
29}
30ORDER BY ?lemma
31================
32SELECT ?ville ?b WHERE {
33 ?ville wdt:P31 wd:Q515 .
34 ?ville rdfs:label ?b .
35 FILTER (?b = "Paris"@fr)
36}
37LIMIT 100
38================
39SELECT DISTINCT ?city ?cityLabel ?location ?populatie2 WHERE {
40 wd:Q9832 wdt:P1082 ?populatie .
41 ?city wdt:P1082 ?populatie2 ;
42 wdt:P625 ?location .
43 FILTER (ABS(?populatie - ?populatie2) < 1000) SERVICE wikibase:label
44 {
45 bd:serviceParam wikibase:language "en,nl"
46 }
47}
48====================
49PREFIX wikibase: <http://wikiba.se/ontology>
50SELECT DISTINCT ?property ?label {
51 {
52 SELECT ?property ?label WHERE
53 {
54 ?property a wikibase:Property ;
55 rdfs:label ?label
56 FILTER (LANG(?label) = "en") .
57 FILTER (CONTAINS(LCASE(?label) , LCASE("software")))
58 }
59 }
60 UNION {
61 SELECT ?property ?label WHERE
62 {
63 [ rdfs:label ?ilabel ] wdt:P1963 ?property .
64 ?property rdfs:label ?label
65 FILTER (LANG(?label) = "en") .
66 FILTER (LANG(?ilabel) = "en" && CONTAINS(LCASE(?ilabel) , LCASE("software")))
67 }
68 }
69 UNION {
70 SELECT DISTINCT ?property ?label WHERE
71 {
72 ?property a wikibase:Property ;
73 wdt:P31 [ rdfs:label ?ilabel ] ;
74 rdfs:label ?label
75 FILTER (LANG(?label) = "en") .
76 FILTER (LANG(?ilabel) = "en" && CONTAINS(LCASE(?ilabel) , LCASE("software")))
77 }
78 }
79}
80ORDER BY ?label
81================
82SELECT ?item ?itemLabel ?dod ?fecha_de_nacimiento ?sexo_o_g_nero ?pa_s_de_nacionalidad ?modified WHERE {
83 ?item wdt:P570 ?dod .
84 ?item wdt:P31 wd:Q5 .
85 ?item schema:dateModified ?modified SERVICE wikibase:label
86 {
87 bd:serviceParam wikibase:language "es,en,it,fr,de,cs" .
88 }
89 OPTIONAL {
90 ?item wdt:P569 ?fecha_de_nacimiento .
91 }
92 OPTIONAL {
93 ?item wdt:P21 ?sexo_o_g_nero .
94 }
95 FILTER (?dod > "2017-12-31T00:00:00Z"^^xsd:dateTime)
96 FILTER (?dod < (NOW ()))
97 FILTER (?modified > "2019-03-05T00:00:00Z"^^xsd:dateTime) OPTIONAL {
98 ?item wdt:P27 ?pa_s_de_nacionalidad .
99 }
100}
101ORDER BY DESC(?modified) DESC(?dod) ?item
102====================
103SELECT ?botanist ?abbrev ?botanistLabel ?articleEn ?articleFr WHERE {
104 ?botanist wdt:P428 ?abbrev .
105 OPTIONAL {
106 ?articleEn schema:about ?botanist .
107 FILTER (SUBSTR(STR(?articleEn) , 1 , 25) = "https://en.wikipedia.org/")
108 }
109 OPTIONAL {
110 ?articleFr schema:about ?botanist .
111 FILTER (SUBSTR(STR(?articleFr) , 1 , 25) = "https://fr.wikipedia.org/")
112 }
113 SERVICE wikibase:label
114 {
115 bd:serviceParam wikibase:language "fr,en,en" .
116 }
117}
118====================
119SELECT DISTINCT ?entity ?wikiUrl ?gender ?birth_date ?death_date ?birth_place ?birth_country_code WHERE {
120 ?wikiUrl schema:about ?entity ;
121 rdf:type schema:Article ;
122 schema:isPartOf / wikibase:wikiGroup "wikipedia" ;
123 schema:name "Dav Pilkey"@en .
124 ?entity wdt:P460 * / wdt:P31 / wdt:P279 * wd:Q5 .
125 OPTIONAL {
126 ?entity wdt:P569 ?birth_date
127 }
128 OPTIONAL {
129 ?entity wdt:P570 ?death_date
130 }
131 OPTIONAL {
132 ?entity wdt:P21 ?gender_entity .
133 BIND (IF(?gender_entity = wd:Q6581097 , "M" , IF(?gender_entity = wd:Q6581072 , "F" , "")) AS ?gender)
134 }
135 OPTIONAL {
136 ?entity wdt:P19 / rdfs:label ?birth_place .
137 FILTER (LANG(?birth_place) = 'en')
138 }
139 OPTIONAL {
140 ?entity wdt:P19 * / wdt:P17 / wdt:P297 ?birth_country_code
141 }
142}
143====================
144ELECT DISTINCT ?species ?taxid ?gene ?gene_id WHERE {
145 VALUES ?taxid
146 {
147 "9606"
148 }
149 VALUES ?gene_id
150 {
151 "A1CF"@en "A1CF"
152 }
153 ?gene rdfs:label | wdt:P2393 ?gene_id .
154 {
155 ?gene wdt:P703 ?species .
156 }
157 UNION {
158 ?gene wdt:P703 ?species2 .
159 ?species2 wdt:P460 ?species .
160 }
161 ?species wdt:P685 ?taxid .
162}
163====================
164SELECT DISTINCT ?item {
165 ?item (wdt:P570) ?time0 .
166 FILTER (?time0 >= "365-01-01T00:00:00Z"^^xsd:dateTime && ?time0 <= "365-12-31T23:59:59Z"^^xsd:dateTime)
167}
168====================
169SELECT ?entityLabel ?date ?url WHERE {
170 ?entity wdt:P569 ?date .
171 ?entity wdt:P485 wd:Q814779 .
172 ?entity p:P485 ?statement .
173 ?statement prov:wasDerivedFrom ?ref .
174 ?ref pr:P854 ?url .
175 FILTER REGEX(STR(?url) , "hdl.handle.net/10079/fa/beinecke") SERVICE wikibase:label
176 {
177 bd:serviceParam wikibase:language "en" .
178 }
179}
180ORDER BY MONTH(?date) DAY(?date)
181=====================
182SELECT ?city ?countrycode ?population ?cityLabel ?language ?lat ?long WHERE {
183 ?city wdt:P17 ?country .
184 ?country wdt:P297 ?countrycode .
185 ?city rdfs:label ?citysearchLabel .
186 ?city p:P625 ?statement .
187 OPTIONAL {
188 ?city wdt:P1082 ?population .
189 }
190 ?statement psv:P625 ?coordinate_node .
191 ?coordinate_node wikibase:geoLatitude ?lat .
192 ?coordinate_node wikibase:geoLongitude ?long .
193 ?city rdfs:label ?cityLabel .
194 BIND (LANG(?cityLabel) AS ?language)
195 FILTER (REGEX(?countrycode , UCASE("co")))
196 FILTER ((LANG(?citysearchLabel)) = "en")
197 FILTER (REGEX(?citysearchLabel , "^Acacías$"))
198 FILTER ((ABS(?lat - "3.98749995231628"^^xsd:decimal)) < "0.5"^^xsd:decimal)
199 FILTER ((ABS(?long - "-73.7566680908203"^^xsd:decimal)) < "0.5"^^xsd:decimal)
200}
201
202

I will probably add more but these should be good for now.

Change 505642 abandoned by Smalyshev:
Inline value and reference URIs handler

Reason:
superceded by https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/ /506045

https://gerrit.wikimedia.org/r/505642

Configuration options are assigned in RWStore.properties. Particular options are:

  • Inlined Value and Reference URIs:

com.bigdata.rdf.store.AbstractTripleStore.inlineURIFactory=org.wikidata.query.rdf.blazegraph.WikibaseInlineUriFactory$V001

  • Raw records support disabled:

com.bigdata.rdf.store.AbstractTripleStore.enableRawRecordsSupport=false

  • Inlining of short text literals, Max Length has to be assinged as a threshold to inline literals:

com.bigdata.rdf.store.AbstractTripleStore.inlineTextLiterals=true
com.bigdata.rdf.store.AbstractTripleStore.maxInlineTextLength=40

A combination of parameters might be applied. The most promising combination is inlining of Value and Reference URIs and disabling Raw Records support:

com.bigdata.rdf.store.AbstractTripleStore.inlineURIFactory=org.wikidata.query.rdf.blazegraph.WikibaseInlineUriFactory$V001
com.bigdata.rdf.store.AbstractTripleStore.enableRawRecordsSupport=false

Additionally tested configuration option with only Raw records disabled, comparing to original baseline:

  • takes 1.7% more time, produces journal of 9.2% less bytes, 77% less allocations with their overall size 38.9% less, though the are 2.9% more blobs allocations with 7.5% more size.

This might be an option to consider, even without value and reference URIs inlining.

Change 506045 merged by jenkins-bot:
[wikidata/query/rdf@master] Inline value and reference URIs without core blazegraph changes

https://gerrit.wikimedia.org/r/506045

Change 516587 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[wikidata/query/rdf@master] Disable raw records in config

https://gerrit.wikimedia.org/r/516587

Change 516587 merged by jenkins-bot:
[wikidata/query/rdf@master] Disable raw records in config

https://gerrit.wikimedia.org/r/516587

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:25 PM
Smalyshev closed this task as Resolved.Jul 8 2019, 10:39 PM