Page MenuHomePhabricator

Open questions for Blazegraph data model research
Closed, ResolvedPublic

Description

I'm creating this task to record open questions/issues that we have for Blazegraph implementation, for the stage of the data level research. I'll update these periodically. If anybody thinks better form/forum is preferable, please tell.

  1. RDR syntax. This produces an error:
<<entity:Q16 v:P1451 "A mari usque ad mare"@la>>
 wikibase:Rank wikibase:PreferredRank ;
 q:P805 entity:Q41423 .

The error message is:

Caused by: java.lang.RuntimeException: Could not load: url=file:///home/smalyshev/dump-3m.fixed.ttlx.gz, cause=org.openrdf.rio.RDFParseException: Illegal language tag char: '>' [line 8287]

What is the correct form?

  1. Query performance. This query:
SELECT ?h ?date WHERE {
  ?h wdt:P31 entity:Q5 .
  ?h wdt:P569 ?date .
  FILTER NOT EXISTS {?h wdt:P570 ?d } 
} LIMIT 100

is very slow (minutes) on 3m entities dump. Seems to be exclusively for FILTER NOT EXISTS - other filters, like FILTER (?date < "00000001880-01-01T00:00:00Z"^^xsd:dateTime) works fine.

Is there some issue/configuration/rewrite that would make it fast? Didn't have this problem previously on einsteinium with full dump, but now the truthy dump makes trouble.

  1. Date types. The DB seems to accept values marked ^^xsd:dateTime fine, but it looks like it uses them as text, not dates, since these filters: FILTER (?date < "1880-01-01"^^xsd:date) or FILTER (?date < "1880"^^xsd:gYear) don't seem to work. We'll probably implement our own date handling eventually but it'd be nice to understand how it works for now. Also, see two forms of date display:
<http://wikidata-wdq.testme.wmflabs.org/entity/Q2587183>	-3059-01-01T00:00:00.000Z
<http://wikidata-wdq.testme.wmflabs.org/entity/Q3142361>	00000001979-00-00T00:00:00Z

Which is a bit strange.

  1. Extensibility. Right now I'm running the server with:
java -server -Xmx4g -jar bigdata-1.5.0.jar

but the configuration mentions a lot of properties that can be configured. How it is done? Also, how one would add functions/extensions, etc. to the jar?

5. Backup - I wonder, if I just copy .jnl file in a random moment of time, and restore it later, is it an ok scenario? What about the HA setup?

Event Timeline

Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev subscribed.
Smalyshev set Security to None.

Regarding the executable jar, you can pass the property file with -Dbigdata.propertyFile=<path>

java -server -Xmx4g -Dbigdata.propertyFile=/etc/blazegraph/RWStore.properties -jar bigdata-1.5.0-bundled.jar

It is fine to run this way for development and testing. For production deployment, we would recommend (currently) using the tarball distribution with the startNSS startup scripts.

We are accelerating our Debian distribution, per http://trac.bigdata.com/ticket/979. It will likely be in a 1.5.2 or 1.6 release.

It might be useful to take some of these questions to the bigdata-developers mailing list. Some of these questions already have answers on the wiki.

  1. RDR Syntax. For the RDR exception, please create a unit test. There are a few places where that test could be written. For example TestReificationDoneRightEval (core SPARQL evaluation test suite) or TestRDROperations (test evaluation against the REST end point).
  1. It would be great to get a set of queries for performance testing the wikidata dump. This might be a query plan issue. I can try to run it against a local copy that I've loaded from the following files, but this might be different data. I suggest that having a SPARQL end point that was open to the group might be useful.

wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-terms.nt.gz
wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-properties.nt.gz
wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-statements.nt.gz
wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-simple-statements.nt.gz
wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-taxonomy.nt.gz
wget http://tools.wmflabs.org/wikidata-exports/rdf/exports/20150126/wikidata-instances.nt.gz

  1. Not sure off hand what is happening here. However, this could be a type casting issue. xsd:dateTime and xsd:gYear are not the same thing. They would be indexed as different data types. A key range scan on one would not intersect with a key-range scan on the other.
  1. We are doing a lot of work on deployers. The HAJournalServer page of the wiki documents a lot of these options. Try doing "ant stage" and then using either the startNSS or the startHAServices scripts. These share a lot of the options and expose a lot of the options that are supported. @brad is the point person on the developers work.
  1. Backup: You can not obtain a coherent copy of the journal if there are writers executing. If you suspect writes at the application layer, then this works. HA provides online backup. This is documented on the HAJournalServer page of the blazegraph wiki. There are two kinds of online back. Snapshots, which are coherent compressed (Gzip) views of the database. Transaction logs (called HALog files) which are per commit point logs of the write set of the transaction at the lowest level. Both are fully online and do not block writers or readers. You can also deploy an HA1 mode that does online backup, but the standard NSS does not support this.

What is the correct prefix declaration for "wdt"?

Prefixes used in current format:

@prefix wikibase: <http://www.wikidata.org/ontology-0.0.1#> .
@prefix schema: <http://schema.org/> .
@prefix cc: <http://creativecommons.org/ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

@prefix data: <http://wikidata-wdq.testme.wmflabs.org/Special:EntityData/> .
@prefix entity: <http://wikidata-wdq.testme.wmflabs.org/entity/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix wdt: <http://wikidata-wdq.testme.wmflabs.org/entity/assert/> .
@prefix v: <http://wikidata-wdq.testme.wmflabs.org/entity/value/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix ref: <http://wikidata-wdq.testme.wmflabs.org/entity/reference/> .
@prefix q: <http://wikidata-wdq.testme.wmflabs.org/entity/qualifier/> .

Unfortunately, the 3m dump is ~2G (and the 10m one is 3.5G, don't have full one for this format yet) and I don't have a place yet to publish such things, but if anybody is interested I probably could make a shorter version of it.

Ok. That query

PREFIX wdt: <http://wikidata-wdq.testme.wmflabs.org/entity/assert/>
PREFIX entity: <http://wikidata-wdq.testme.wmflabs.org/entity/> 
SELECT ?h ?date WHERE {
  ?h wdt:P31 entity:Q5 .
  ?h wdt:P569 ?date .
  FILTER NOT EXISTS {?h wdt:P570 ?d } 
} LIMIT 100

runs instantly but does not produce any solutions for me against the data that I have loaded.

We'll have to work against the same data sets for us to really analyze query plans.

@Thompsonbry.systap

I'll try the mailing list, thanks.

  1. Does it imply the error I'm seeing is a bug and not something I'm doing wrong?
  1. Yes, it's different data. We intend to have useable dumps soon (hopefully, next week) but don't have them yet. I'll see maybe I can put this one somewhere where it can be fetched from outside.
  1. I was hoping data types are convertible between them - i.e. you can have "1850"^^xsd:gYear and "1880-01-01"^^xsd:date and search for both with queries saying "everything before year 1900" but if it's not the case I guess we would just use xsd:dateTime everywhere. What worries me more is two different forms of data display - I suspect it treats some of them as strings maybe... but maybe I'm wrong.
  1. I am not sure. That's why I would like to see it in a test case.

Note that the openrdf jars need to appear before the blazegraph jars on the classpath in order for the blazegraph RDF parser not to be replaced by the openrdf parser. If that happens then it will not recognize the RDR constructs. This is an issue that @brad plans to address in the deployers.

  1. I will be out of pocket next week.
  1. Not sure. Again, a unit test could clarify this so I could see the source data file, how the database was configured, and how the database is interpreting the values.
  1. I am running the bundled jar, so there should be no classpath issues.
  1. Current versions of the dumps (short ones, 3m entities) can be seen here: http://wdq-wikidata.testme.wmflabs.org/dumps/

I'll look into how to do the unit tests.

Great! Please put my name on ticket so I will see it.

@Beebs.systap Unfortunately the executable JAR might not provide a
sufficiently strong class ordering guarantee.

This may not be the right ticket, but I did some experimentation with the data sets that I referenced above looking at parameterization of the load. Using an Intel 2011 Mac Mini with 16GB of RAM and an SSD I have a total throughput across all datasets of 6 hours, which is basically 20k triples per second (tps) over 429M triples loaded. The best parameters are below. This configuration used slightly more space on the disk (66G vs 60G). It uses a much smaller branching factor for the OSP index and the small slot optimization on the RWStore to attempt to co-locate the scattered OSP index updates (the updates for this index are always scattered because the inserts are always clustered on the source vertex - this is just how it works out for every application I have seen.)

I am using a write cache with 1000 native 1M buffers. You could increase this and probably reduce the IO wait further. I would suggest trying with 2000 buffers and see what impact it has.

You should be able to realize a performance gain by defining some wiki data specific vocabularies to inline frequently used URIs into 2-3 bytes. This would reduce the average stride on the statement indices since the predicate (link type) position will typically be 2-3 bytes. It also improves the query performance somewhat since vocabulary items do not require dictionary joins (but we do cache the frequently used terms in the lexicon relation regardless). I generally approach vocabulary definition by simply capturing the frequently used predicates for the domain. However, it is also possible to write a SPARQL query that computes the most common predicates and then feed that into the vocabulary definition process.

We can repeat this experimentation again once the new data sets are ready.

#
# Note: These options are applied when the journal and the triple store are
# first created.

##
## Journal options.
##

# The backing file. This contains all your data.  You want to put this someplace
# safe.  The default locator will wind up in the directory from which you start
# your servlet container.
com.bigdata.journal.AbstractJournal.file=bigdata.jnl

# The persistence engine.  Use 'Disk' for the WORM or 'DiskRW' for the RWStore.
com.bigdata.journal.AbstractJournal.bufferMode=DiskRW

# Setup for the RWStore recycler rather than session protection.
com.bigdata.service.AbstractTransactionService.minReleaseAge=1

# Enable group commit. See http://wiki.blazegraph.com/wiki/index.php/GroupCommit
# Note: Group commit is a beta feature in BlazeGraph release 1.5.1.
#com.bigdata.journal.Journal.groupCommit=true

com.bigdata.btree.writeRetentionQueue.capacity=4000
com.bigdata.btree.BTree.branchingFactor=128

# 200M initial extent.
com.bigdata.journal.AbstractJournal.initialExtent=209715200
com.bigdata.journal.AbstractJournal.maximumExtent=209715200

# Create namespace (triples+RDR, no-inference, no text index)
com.bigdata.rdf.sail.truthMaintenance=false
com.bigdata.rdf.store.AbstractTripleStore.quads=false
com.bigdata.rdf.store.AbstractTripleStore.statementIdentifiers=true
com.bigdata.rdf.store.AbstractTripleStore.textIndex=false
com.bigdata.rdf.store.AbstractTripleStore.axiomsClass=com.bigdata.rdf.axioms.NoAxioms
# FIXME DEFINE AND USE WIKI DATA VOCABULARY CLASS
# Bump up the branching factor for the lexicon indices on the default kb.
com.bigdata.namespace.kb.lex.com.bigdata.btree.BTree.branchingFactor=400
com.bigdata.namespace.kb.lex.ID2TERM.com.bigdata.btree.BTree.branchingFactor=800
com.bigdata.namespace.kb.lex.TERM2ID.com.bigdata.btree.BTree.branchingFactor=128
# Bump up the branching factor for the statement indices on the default kb.
com.bigdata.namespace.kb.spo.com.bigdata.btree.BTree.branchingFactor=1024
com.bigdata.namespace.kb.spo.OSP.com.bigdata.btree.BTree.branchingFactor=64
com.bigdata.namespace.kb.spo.SPO.com.bigdata.btree.BTree.branchingFactor=600
# larger statement buffer capacity for bulk loading.
com.bigdata.rdf.sail.bufferCapacity=100000
# Override the #of write cache buffers to improve bulk load performance. Requires enough native heap!
com.bigdata.journal.AbstractJournal.writeCacheBufferCount=1000

# Enable small slot optimization!
com.bigdata.rwstore.RWStore.smallSlotType=1024

Thanks,
Bryan

Change 208029 had a related patch set uploaded (by Smalyshev):
Set up default settings as suggested in T92308

https://gerrit.wikimedia.org/r/208029

@Thompsonbry.systap I've added your recommended settings to our default config. One question: I see com.bigdata.namespace.kb - so I imagine this is per-namespace setting and if we use different namespace we should change it? Can we set it for all namespaces?

Yes but. There are global defaults and you can set them. For example, if
you list out e namespace properties you will see how the 128 default is
set. The issue is that we are using different branching factors for the
spo and lex relations, and even inside of that for the different indices.
But the overrides apply to a prefix.

I've been toying with the though of changing this to a regex. I have not
looked at what would be the impact of that yet. Plus there could be
conflicting overrides with regex. But it might solve some problems.

So, the short answer is not really because of the prefix not working out
appropriately.

Bryan

Change 208029 merged by jenkins-bot:
Set up default settings as suggested in T92308

https://gerrit.wikimedia.org/r/208029

@Thompsonbry.systap I've got this message:

WARN : AbstractBTree.java:2135: Bloom filter disabled - maximum error rate would be exceeded: entryCount=1883228, factory=BloomFilterFactory{ n=1000000, p=0.02, maxP=0.15, maxN=1883227}

Anything to worry about/change the settings/etc. or this is normal?

That is normal. You can choose to explicitly disable bloom filters in
advance. Otherwise they are disabled once their expected error rate would
be too high. Nothing to be concerned about.

Bryan

I think we're done with this now.