Page MenuHomePhabricator

Evaluate the Apache Jena Framework
Closed, ResolvedPublic

Description

Apache Jena provides SPARQL 1.1, SHACL (core and SPARQL), ShEx, and RDF-star.

Using the TDB2 storage component:

Jena has been reported to load Wikidata (20211222_latest-all.nt.gz) (16.7 billion triples at 44.8k triples/second in 103h 45m 15s).

Jena has been reported to load Wikidata Truthy (2021-12) (6.6 billion triples in 40 hours, at 46k
triples/second).

Jena has various extension mechanism for incorporating extensions, including
overloading the SERVICE keyword.

Jena might provide a way for users to load the data, or a focused subset of the
data, for local use thereby potentially offloading the central SPARQL service.

Event Timeline

dcausse renamed this task from Evaluate Apache Jena to Evaluate Apache Jena TDB2.Jan 21 2022, 10:47 AM
dcausse updated the task description. (Show Details)
AndySeaborne renamed this task from Evaluate Apache Jena TDB2 to Evaluate Apache Jena.Jan 21 2022, 11:25 AM
AndySeaborne updated the task description. (Show Details)

Hi @dcausse - TDB2 on it's own doesn't provide SPARQL nor any of the other features. TDB2 is just one low level storage choice - it's not a standalone thing and on its own can't provide any offloading from the central service. I hope you find the description clearer now.

The project has had other 3rd party organisations presenting their own naming and architecture descriptions of Jena so I just wanted to be clear here.

The choices for WD are interesting and non-trivial. If there is anything else I can help with - please do ask.

@AndySeaborne -- It has been my understanding that Apache Jena (the framework) performs differently (which may include different speeds of various actions, which may have different limitations and/or comprise a different list) when the active "low level storage choice" (such as TDB or TDB2) is changed, such as from TDB to TDB2 to Virtuoso, to any engine that offers a Data Provider for Jena (or vice versa).

If my past understanding remains correct, I think the title of this task would be appropriately changed to Evaluate Apache Jena with TDB2, and that there ought to be some parallel tasks created with titles adjusted to include the "low level storage choice" made for that task, one of which should be Evaluate Apache Jena with Virtuoso in that role.

There may be other variables which may make sense within a single evaluation task, and others which may make more sense as a distinct evaluation task.

If you think it only makes sense for an omnibus task to Evaluate Apache Jena, then I submit that there should at least be multiple subtasks with the "low level storage choice" variability I've described above.

All - I'm sorry that this sub-task is being redirected to be about Virtuoso. This would be better moved to the Virtuoso task.

Apache Jena releases a single software product. TDB is the only persistence layer for Apache Jena that comes from the Apache Jena project.

The link @TallTed gives is to Virtuoso-specific documentation. The software does not come from the Apache Jena project.

TDB is the _internal_ name for a component which is the B+Trees. TDB2 is the current generation of that component.

SPARQL is important aspect for Wikidata and the first code example shows the bypass of Apache Jena SPARQL execution and only thin use of the Java API.


@TallTed: The examples on the page do not describe Virtuoso used as a "low level storage choice" for SPARQL execution; it shows complete bypass of the Jena.

It should be on the task for evaluating Virtuoso because it is what OpenLink is providing. It is 5% Jena (API layer) and 95% Virtuoso. All performance and data scale characteristics are down to Virtuoso.

I do not understand why WikiData usage would want to bypass the Virtuoso triplestore HTTP interface but if you want that considered, it would be better as part of the Virtuoso evaluation. It can be compared to the same approach with other code APIs accessing Virtuoso.

The diagram, at best, it might be said to relate to the design of the research prototype Jena1 (over 15 years ago) many years before Jena became Apache Jena. SPARQL didn't exist for that architecture which predates W3C work on SPARQL.

  • SPARQL evaluation does not go through the Model API.
  • Apache Jena does not provide storage in SQL databases anymore.
  • TDB does not store models.
  • TDB isn't even mentioned on the diagram.

The page you link to talks about Jena 2.6, which is not an Apache release, and Jena 2.10.0 is 2013-02-24 - during the transition to Apache Jena.

Virtuoso can provide fine-grained access with VirtGraph but that is not how TDB fits into Jena.
Using VirtGraph might get Virtuoso users SHACL/ShEx support (RDF-star will not work) but that isn't the focus for WikiData as I understand it.

If you want to discuss the general integration of Virtuoso and Jena, then let's take that to the Jena mailing lists.

I have to agree with @AndySeaborne - talking about "Apache Jena with TDB2" makes as much sense as talking about "VW Beetle with an internal combustion engine". The framing makes it sound like Beetles come with all kinds of engines, though in reality they've all been equipped with an ICE at the factory so far. There are electric conversion kits for hobbyists etc. but that's a really marginal thing and needs its own discussion.

Similarly, TDB(2) is an integral component of Apache Jena - by far the most common setup and the only one supported by the Apache Jena project. It would be possible to compare TDB1 vs. TDB2, but those are just iterations of the same storage technology, TDB1 is on the way out, and any new evaluations should be made with TDB2.

Disclaimer: I'm a developer (committer & PMC member) for Apache Jena as well as a contributor to Wikidata (esp. mappings to controlled vocabularies such as YSO and GACS).

Sorry for the confusion that the rename I did of this task caused.
Just to bring clarity on my reasoning as a maintainer of the wikidata query service stack as to why being specific on TDB2 might be helpful:

  • Some components of Jena are already being used (i.e. the sparql parser for query analysis)
  • Jena has been considered in 2015 but declined ref: T90112 (sadly no reasons were given)

This task is I think about evaluating Jena and its storage component as a storage/query engine for Wikidata Query Service but it does not mean that all of what Jena offers will be discarded if this task is declined.

I read the whole thread and just want to point out that Jena supports SPARQL Update also.

From what I can see, it seems to be able to replace Blazegraph. But it does not solve the issue of having multiple parallel servers all with their own snapshot of the current WD triples.

Maybe it is currently not possible to avoid that, but it would be nice to have all the triples in ONE place and serve them from multiple servers who handle the SPARQL-requests.

@So9q : How would you like to serve everything from one place? It is normal to have replica of data. One of the big bottlenecks is IO. Or do I understand something wrong?

MPhamWMF triaged this task as Medium priority.Jan 24 2022, 4:28 PM
MPhamWMF moved this task from Incoming to Scaling on the Wikidata-Query-Service board.

Hi @AndySeaborne What is the latest benchmarks for loading Wikidata all and truthy with Jena 4.4.0 release and the new TDB2 xloader with "--threads" argument? I noticed the release notes said this:

== Improved bulk loader

This release includes the version of the TDB2 xloader for very large
datasets.

It has been used to load 16.6B triples (WikiData all) into TDB2 and
loading truthy (6B) on modest hardware. Thanks to Marco, Lorenz and
Øyvind for running Wikidata load trails.

The loader now now has "--threads=" which been reported to give improved
load times (if the server has the hardware!).

and what was that modest hardware? a link for their trials details would be great.

You add Fuseki to Jena to get a SPARQL endpoint. Jena + Fuseki is reasonable to investigate as a Blazegraph Alternative.

@Thadguidry -

https://lists.apache.org/thread/vso02pwg4z6qcs3r1h0mcbc86ls74bhm

where --parallel (the argument on sort(1) that is set by --threads) was set to 16.

It took 31h compared to 39h without --parallel on sort(1).

@AWesterinen - Fuseki is part of Jena. Most of the subsystems have informal names. People refer to "Jena" or "Fuseki" interchangeably and the context is the task they are doing. Being more specific on naming didn't catch on.

@AndySeaborne Agree. I was erring on the side of explaining where the SPARQL endpoint came from (not Jena TDB).

So9q renamed this task from Evaluate Apache Jena to Evaluate the Apache Jena Framework.Feb 8 2022, 9:28 AM

@So9q : How would you like to serve everything from one place? It is normal to have replica of data. One of the big bottlenecks is IO. Or do I understand something wrong?

I actually do not have a lot of experience with big data and IO bottlenecks, so you might be right.

With RDF Delta and Jena it might be feasible to continue with a cluster of servers each getting RDF Delta updates and serving requests?

But Wikidata is perhaps/probably going to outgrow what a single server can handle within 5 years, so it might be worth now to test with 10x (or 50x) the current amount of triples/items*, so we make sure the next system can handle the sheer amount of data.

We still have not modeled more than perhaps 5% the knowledge in Wikipedia. Wikidata can grow a lot when Abstract Wikipedia picks up momentum, when all science papers are imported (50M missing and a lot of citation triples missing, a lot of MeSH and similar classifications are missing), when all chemicals are imported (100M missing according to @EgonWillighagen), etc.

\* The current size is wikidata-20220103-all.json.gz (109.04 GiB) => Test all candidates with 1000 GB of RDF data (10x).

There are multiple aspects to a solution including making better use of federation. So, I would not rule out a single server but indeed, it is not feasible without making additional changes. Please see Blazegraph Candidate Alternatives for some (work-in-progress) thoughts.

MPhamWMF raised the priority of this task from Medium to High.Mar 21 2022, 2:46 PM
MPhamWMF lowered the priority of this task from High to Medium.
MPhamWMF raised the priority of this task from Medium to High.Mar 29 2022, 1:33 PM