Page MenuHomePhabricator

BlazeGraph Finalization: Machine Sizing/Shaping
Closed, ResolvedPublic

Description

We need to figure out how machines should be shaped. I imagine its similar to Elasticsearch, actually. CPU is kind of important. RAM is super important. Smallish but fast SSDs. But we need to investigate.

Event Timeline

Manybubbles assigned this task to Smalyshev.
Manybubbles raised the priority of this task from to Medium.
Manybubbles updated the task description. (Show Details)

Running it with 16g of RAM seems to behave fine, though it takes 500% CPU (on 16 CPU machine) even when running single query. Simpler queries behave within reason, but this one for example:

prefix data: <http://www.wikidata.org/entity/>
SELECT ?p WHERE {
	?p (data:P31s/data:P31v)+ data:Q28640 .
}

This should probably become better when we have direct paths combining data:P31s/data:P31v into one, but right now this query run for 5 mins and shows no sign of stopping. Looks like unbounded traversals are not very efficient.

Also, label lookups seem to be pretty expensive on top of regular query, which may be no surprise, since label is just another triple, and on top of that there's a lot of them. I wonder if having separate service to look up labels would be useful?

There are two known issues with the property path operator that Nik and I discussed today. These are [1] and [2]. The first of these issues means that the operator is actually fully materializing the property paths *before* giving you any results. [1] is a ticket to change that an do incremental eviction of the results. That will fix the "liveness" issue you are seeing with that path expression. The other ticket deals with another problem where the property path operator can become CPU bound if the necessary access paths are in cache since there is no IO Wait. However, I suspect that [2] will disappear when we address [1]. As for the timing on this, we can elevate this to a critical issue and get it done in our next sprint for the 1.5.1 release.

Nik and I also discussed the inference vs materialization question as it relates to property paths. Out of all the property hierarchies, only the geo-political one involves transitive traversal up more than a single predicate. I suspect that (with the fix to [1]) that you can easily use property paths for all such expressions.

The draft of SPARQL property path support included a mechanism to bind the actual path onto variables. Unfortunately this did not make it into the final specification of property paths. Using property paths you can specify the path as a regular expression of predicates to be traversed, but you can not extract the actual vertices along that path. However, the blazegraph property path operator internally *does* support this along with minimum and maximum traversal limits. The ALP (Arbitrary Length Path) Service [3] exposes all of this using an ASTOptimizer. See ASTALPServiceOptimizer in the 1.5.0 release. It is pretty scarce on documentation. I will fix that. I will also add a page on our wiki for this. See [4] for this.

Thanks,
Bryan

[1] http://trac.bigdata.com/ticket/1003 (Property path operator should output solutions incrementally)
[2] http://trac.bigdata.com/ticket/994 (Property Path operator can become CPU bound)
[3] http://trac.bigdata.com/ticket/1072 (Configurable ALP Service)
[4] http://trac.bigdata.com/ticket/1117 (New ticket to document the ALP Service)

By label lookup I assume that you mean materializing (projecting out through a SELECT expression) the actual RDF Values for URIs or Literals that have become bound by the query. In general, join processing can proceed without dictionary materialization and that materialization step is deferred until the variable bindings are projected out. At that point they require scattered IOs against the reverse dictionary index (ID2TERM). This does incur overhead.

There are options to inline the full serialization of URIs and Literals. However this decreases the effective stride of the indices and has a negative impact on join performance. So these things get traded off - join performance for statement indices and join performance for RDF Value materialization when projecting out solutions.

Those scatter lookups are much faster against SSD.

Concerning thread.

See [1] for a general background on query optimization for blazegraph. The QueryEngine class in blazegraph supports parallelism across queries (concurrent non-blocking queries), within queries (different operators in the same query will execute concurrently), and within operators (most operators can be instantiated multiple times if there is enough work in their input queue and those instances will run concurrently when this happens).

At a high level, blazegraph is using threads to schedule IOs. A typical query will have a bunch of PipelineJoins. Those joins will each have their own thread and those threads will be scheduling IOs. (We have looked at async IO quite a bit and concluded that it did not appear to offer a clear win in performance while making the code patterns more complex.)

BlazeGraph actually does very well with heavy concurrent client workloads. For example, the commercial OwlIM (now GraphDB) platform is close to blazegraph with 4 clients on a concurrent workload mixture but we completely dominate them with 64 clients each running the same mixture.

A lot of the GC overhead is the intermediate solutions awaiting in queues in front of each operator in the query engine. One way to improve the single client performance is to transparently migrate those intermediate solution queues onto native memory, just as we do with the analytic query mode. This would also allow us to increase the vector size and that would improve the performance when there are only a few clients.

Thanks,
Bryan

[1] http://wiki.bigdata.com/wiki/index.php/QueryOptimization

In terms of the machine shape, the general guidelines you give are appropriate. However, here is out it plays out in terms of GC. Large heaps => long GC pauses. So you want to keep the JVM heap fairly small (4G => 8G). Analytic queries can use the native C process heap for hash index joins and (in the future) for storing intermediate solutions. So the actual C process heap (for the JVM) can be bigger. If you are bulk loading data then you want more write cache buffers. Those are 1MB buffers. You can have 6 => 1000s. This also helps for bulk load onto disks that can not reorder writes (SATA).

The rest of that RAM is going to buffer the file system and decrease IO Wait.

Some of our customers also use warmup procedures to avoid cold start performance. There are a couple of aspect of the cold start issue. One is just that things are slow because they are on the disk. Another is that the JVM is not optimized yet against the code. However, yet another impact is that the data has a longer dwell time during query execution because it takes longer to execute the query. This makes the GC overhead higher for cold disks / cold JVM scenarios.

One warmup procedure is just to copy the journal file to /dev/nul. Just get it into the OS cache.

Another is to run http://.../bigdata/status?dumpJournal&dumpPages=true This will run through all of the indices and visit all of their pages and provides some interesting reporting. We have been discussing a warmup procedure based on this but which only visits the non-leaf nodes of the indices. After that warmup any leaf would just be a single IO. That should eliminate most of the IO Wait and GC burden associated with slamming a cold node.

And if you are load balancing across nodes, then you can obviously just load balanced based on metrics and gradually shift more load to a node as it heats up.

Thanks,
Bryan

@Thompsonbry.systap I was thinking about maybe separating the statement data and the label/description data in different bigdata instances, since for most searches label data is not needed, we will almost never query against them (ElasticSearch probably does much better job in that regard), and it is needed only when representing the results for human consumption. The query would return a set of IDs (in case entity IDs are needed and not literal values, counts, etc.) and then another lookup will provide the labels if necessary. So having labels and descriptions and statements in the same DB may be counterproductive. But I'm not sure if this is a good idea or not.

Out of all the property hierarchies, only the geo-political one involves transitive traversal up more than a single predicate

We probably have more hierarchies that that. Some examples - professions, biological taxons, genealogy. So we'd need working traversals of unknown length.

We've assigned the property path optimization and will focus on it in our next sprint. In fact, we hope to get started on this late this week.

I'm happy enough with what I know now for this not to block the decision. Generally we know we'll want SSDs and RAM. Probably similar in shape/size to the Elasticsearch boxes. Which are big boxes from our perspective - 128 GB ram, 2 nice SSDs, 12 logical cores.