Page MenuHomePhabricator

Evaluate Titan as graph storage/query engine for Wikidata Query service
Closed, ResolvedPublic

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Needs Triage.
Smalyshev updated the task description. (Show Details)
Smalyshev changed Security from none to None.
Smalyshev added subscribers: Smalyshev, Manybubbles.
Smalyshev added a subscriber: GWicke.

Open questions:

  1. How to handle qualifiers? As an example, such ones as point-in-time and start-date/end-date. Most queries we'd do probably would be interested in current values, but some may need past values.
    1. With properties, it's easier because we can just stuff everything into a vertex, but we need some standard format of storing non-current data.
    2. With edges we have a potential to pollute the graph with myriad of data that we will have to filter out on each query. We may distinguish them by edge properties, but that could complicate many queries. Alternatively, we could make special "past" edges - named with prefix, etc. - so that queries interested in non-current data
    3. Any such processing would require ad-hoc handing of specific properties, such as point-in-time and start/end, to identify which data is current and which is not. We may need to be careful with this since wrong handling of this on initial import would require re-scan of the whole data set.
  2. Right now we completely ignore references. Do we want to keep them in the index too?
  3. Which fields we want to search against - i.e., do we want to search against descriptions (we'd probably need full-text index, like Elastic, to make it useful for anything)

Technical issues:

  1. On import, titan sometimes slows down and gets into GC loops.
  2. On querying, for vertices with a lot of edges (such as wd("Q5").in("P31"), i.e. "humans", titan produces a backend exception:
Caused by: org.apache.thrift.transport.TTransportException: Frame size (17555240) larger than max length (16384000)!

Example query:

g.wd('Q5').in('P31').labelEn[0].

However, g.wd('Q5').in('P31')[0].labelEn works.

Technical issues:

  1. On querying, for vertices with a lot of edges (such as wd("Q5").in("P31"), i.e. "humans", titan produces a backend exception:
> Caused by: org.apache.thrift.transport.TTransportException: Frame size (17555240) larger than max length (16384000)!
>

Possibly related: http://stackoverflow.com/questions/23055507/java-sql-sqlnontransientconnectionexception-org-apache-thrift-transport-ttransp

Bumping up thrift_framed_transport_size_in_mb might help.

Generally, the thrift interface is the old & low-level way to interact with Cassandra. https://github.com/thinkaurelius/titan/issues/312 discusses switching to native transport, but it seems that it isn't merged yet. Native transport has some nifty features like autopaging / streaming.

Another option that might help is https://github.com/thinkaurelius/titan/wiki/Using-Cassandra#titan-embedded-mode, as that cuts out thrift altogether by running cassandra in-process.

@GWicke I'm not sure about cassandra in-process as a way to go because eventually we want to run it as a cluster I assume which may be harder to manage if we link them together. Also, the docs say:

Note, that running Titan with Cassandra embedded requires GC tuning. While embedded Cassandra can provide lower latency query answering, its GC behavior under load is less predictable.

Which is a bit scary to me :) But I'll try how it works and update.

OK, setting thrift_framed_transport_size_in_mb in cassandra.yaml in both Cassandra and Titan (they both have the yaml file) to 256 seems to eliminate the Frame size error, now g.wd('Q5').in('P31').labelEn[0] words and produces 'Douglas Adams' as it should.

So current technical issues seem to be resolved.

Note, that running Titan with Cassandra embedded requires GC tuning. While embedded Cassandra can provide lower latency query answering, its GC behavior under load is less predictable.

Yeah, agreed. GC scaling limits are kind of the achilles heel of JVM apps. And most other GC'ed languages with less refined GCs do even worse.

So current technical issues seem to be resolved.

Cool!

  1. Right now we completely ignore references. Do we want to keep them in the index too?

Yes, although the other parts of statements are more important. One might query for things that are not sourced and/or only have a source in Wikipedia, or one might want to find things that refer to a particular source.

  1. Which fields we want to search against - i.e., do we want to search against descriptions (we'd probably need full-text index, like Elastic, to make it useful for anything)

We probably want to search against everything in statements and site links. We might not need to import labes, descriptions and aliases until we want to have them directly in the output of the search instead of getting them from mysql/memcache afterwards.
Full text not necessary at first: Label, description, aliases full text search is I think the least important feature as our current Elastic setup should already cover these (although not as separate fields). Although I'm sure someone could come up with queries they want that would need this. More interesting is probably values of monolingual text, but still not absolutely necessary to have full text index. None of the current example queries we collected need a full text index. For all the other string values full text would probably not be used.
Though in the future it might be beneficial to be able to meet our searching needs through one endpoint (like Titan) even for normal full text queries, which would allow us to easily add drill down or faceted search features based on graph queries.

  1. How to handle qualifiers? As an example, such ones as point-in-time and start-date/end-date. Most queries we'd do probably would be interested in current values, but some may need past values.

For queries not explicitly asking for something besides the best, current consensus ranks are used to select the relevant statements. I.e. every time one selects a statement only use those that are rank preferred if a one of that rank exist for that property otherwise those of rank normal. Statements ranked as deprecated should be ignored unless specified otherwise in the query. See https://www.wikidata.org/wiki/Help:Ranking .

But yes for e.g. for a query like who was the major of Berlin in 1984 you would need to look at statements that are 1) ranked preferred or normal and 2) that have a qualifier start-date which is <=1984 and 3) that do not have a qualifier end-date or whose end-date is >= 1984 .

  1. With properties, it's easier because we can just stuff everything into a vertex, but we need some standard format of storing non-current data.
  2. With edges we have a potential to pollute the graph with myriad of data that we will have to filter out on each query. We may distinguish them by edge properties, but that could complicate many queries. Alternatively, we could make special "past" edges - named with prefix, etc. - so that queries interested in non-current data

All this should not be decided based on ranks not qualifiers. However if you evaluate which statements (based on rank) are to be considered by default on import then if you still want to answer questions like the above historic one you still need a solution to be able to specifically query for preferred/normal/deprecated statements ignoring default.

  1. Any such processing would require ad-hoc handing of specific properties, such as point-in-time and start/end, to identify which data is current and which is not. We may need to be careful with this since wrong handling of this on initial import would require re-scan of the whole data set.

No, as above. Queries not explicitly asking for something besides the best, current consensus do not need to understand qualifiers, only ranks. That should remove that possibility of incorrect interpretation during import.

@JanZerebecki The issue here is that many items do not have preferred state. I.e. take https://www.wikidata.org/wiki/Q30.

What is the population of the USA? We don't have any number marked as preferred. We either have to report we have no idea about US population, or take the latest value by time point.

Now, if we go to "contains administrative territorial entity", "Territory of Alaska" does not have "deprecated" mark, but it does have end date in 1959, so if we are asked "what are the sub-entites of the USA" we probably should not return it unless we're specifically asked about 1958. So again, we'd need to look into qualifiers.

Now, let's consider https://www.wikidata.org/wiki/Q76. Where does Barack Obama reside? We have two entries - "Capitol Hill" and "the White House" - none of them preferred, but only one of them current.

But here we have more complicated case, because if we look at "awards received", we may want all entries listed, not just the latest one. So how we distinguish that from US population case, where we do want only the latest? Can we do it automatically or we'd have a human to write separate queries for these two?

Now, more generic question. If we have one preferred entry and one not, do we return only the preferred one? What if there are no preferred ones? I.e., in "ethnic group" for Obama we have "African American" and "Irish American", with the former preferred - do we return just that when asked for Obama's ethnic group? But for "surname" we don't have anything preferred - we can't return nothing there, right? We could opt for "return preferred if exists, otherwise all" but that means slowing things down as we'd have to scan all data just to find out if we have any preferred - and we'd have to do it on each step of the search, since match for non-preferred then won't count I assume. And we still have cases like "US population" where getting all of them is no good - i.e. if we ask "what is 10 most populous countries" we do want to compare latest figures, right?

afaik, i remember think our plan for simple queries is to consider "best statements" in queries.

see https://github.com/wmde/WikibaseDataModel/blob/master/src/Statement/BestStatementsFinder.php for how that is determined.

this means, we include:

  1. do not include deprecated statements
  2. only preferred statements, if there are any, within a property-value statement list
  3. if there are not preferred statements, then include normal rank statements

i would take the same approach here.

@aude so how we write a query for "what is the population of the USA"? None of the population values are preferred, and if we don't consider qualifiers we have a dozen of values which we don't know how to distinguish.

then index the normal-ranked statement if it exists. if there are multiple populations that are all normal, then think you get multiple answers.

maybe if we do consider qualifiers, maybe can be somewhat smarter like see if they have dates and pick the most current. or better yet, leave that to the user of the query system to decide how to further filter if desired.

@aude we can leave it to the user but then a simple query like "give me ten most populous countries" would be complicated as each population value would be unfiltered multi-value which needs to be manually processed each time by writing code to filter it. How is it solved now - i.e., how "ten most populous countries" query works? Does it require the user to sort through population values manually or does it automatically return the latest unless you ask for a different one?

Do not worry about the current usage of ranks. As soon as their use becomes more meaningful though queries for example they will be used more.

So we have chicken-and-egg problem here. Should we code for data that is ranked properly (but does not exist yet) and hope the data will catch up, and the querying will be complicated until then, or should we code for current data to make querying the current data easy?

The former. We need to give incentives to make proper use of ranks and qualifiers. Their proper use is pretty important for 3rd party users and data quality.

Write code for data that is ranked properly. But even with properly ranked data everything needs to cope with multiple answers that are contradictory. See it as multiple possible realities. Being able to cope with this is designed into Wikidata and thus needs to carry over to something that queries Wikidata in a generic fashion.

Finding the items that can be corrected in an automatic or human assisted fashion would be a good use case for a WDQ service :) . For very clear cut cases like non-overlapping dates of population counts with no other qualifiers it is probably easy to find someone to make a bot run to correct those on wikidata.org , if there was a way to get a list. ( http://wdq.wmflabs.org can not express a query for that.) For the cases were someone wants to write code to be more clever based on qualifiers, that work should be done in a way that leads to correcting the actual data. (Btw. idea added there: https://www.wikidata.org/w/index.php?title=Wikidata:Constraint_violation_report_input&diff=prev&oldid=178922838 ) For cases where one wants a query result to represent a certain view on reality that cleverness (or at least instructing it) needs to be in the query and not in the service.

OK, this can be done but the issue here is we can't evaluate a solution (e.g. for performance, fitness to data, etc.) such as Titan/Gremlin if we have no data to test it on. Meaning, assume we coded up all the queries under assumption the data is ranked properly. But the current data, as far as I can see, is not. So how can we know if certain query would be fast or not on real data?

Another issue is that if the cleverness is in the query, we can not pre-process data on import. Which means, if we need to do an operation as simple as "select the latest relevant population figure", instead of preparing for it on data import, we'd be doing it every time the query is run, on every query, for every data item. I suspect that may hurt performance, especially if such items are combined. I.e. "get me the mayors of 10 most populous capitals" - now we have 3 data sets to process, for mayors - mayors may be past or current, for capitals - capitals can be current or old, and for population figures. This means instead of considering one data point at each junction, we will have to consider multiple ones - and this work will be done anew each time. Of course, ideally we have all the latest data marked as preferred, but I don't think now it is the case even in relatively common data, and it probably is even worse once the data sets become more exotic and less visible/visited.

Maybe indeed the solution would be to make a bot that just reads qualifiers and puts ranks automatically on the latest figure if there's none, and that bot maybe can run more complex queries than the regular user.

In any case, I'll add rank support to importer/query prototype, but I think we still need to consider optimizing for common case.

This is real data. If the system can not cope with data changes of the discussed scope, i have the suspicion it woun't be able to cope with normal day to day changes in the data.

If optimization for specific types of queries on specific properties is necessary that would constrain the systems usefulness. We should try to avoid that if possible. There are quite a few properties where it is correct to have no preferred and 10 normal ranked statements (examples from Obama: educated at, award received, website account on, described by source).

I would like to know if Titan is capable to compute cases like this in a generic way and still be fast enough. If it is, the correctly ranked data should result in faster responses. Then wouldn't the current data be enough for evaluation?

To write queries that correctly deal with multiple possibilities and also if we were to better optimize the data for answering a specific query their use needs to be more specific. E.g. for the cities with the biggest population there are multiple ways once the date qualifier is exhausted to get the latest number: a) More accurately represent possibilities (each top spot can have multiple possibilities and cities can appear on multiple spots). b) Higher number wins. c) Lower number wins.

Anyway it seems to me that such a query is not an interactive use case. It is like something to be evaluated as CPU is available in a batch way (think Hadoop) and the result saved for viewing, probably updating it at some point in the same way. If queries like that are not fast enough to be done in an interactive way, we need better, more specific use cases for interactive queries.

@JanZerebecki We can query against any data but some queries I assume would be more frequent than others. For example, if we have "country population" stored as array of qualified numbers, then the query "what are the most populous countries" would require scanning each such array for each country. Regular Titan indexes would not support such scans, most probably, because they can not make sense of a complex data structure that population property is now - with multiple values, qualifiers, preferences, etc. Not sure if Elastic can handle such structures efficiently either, since it would require non-trivial logic to extract indexable values. That's why I am thinking about having a single value - probably in addition to having the full data - which can be used for such queries, indexed, etc.

Of course, you are correct in noting that not every property would have such unique value - some of them, like "educated at", would not have any dedicated values and as such will be always handled using the full set of data. I think we still can handle such cases (query examples for verifications are most welcome). However, I think this should not preclude us from having optimization for the common case in case we do know how to handle it more efficiently. It would work for some cases and be useless for some others - so discretion should still be used when writing the actual queries - but at least we would have an option for the fast way.

Yes, it is probably not a good idea to add complex data structures like that.

We need to be clear about the data model / schema we will be using. As a start i quickly sketched a possible mapping from the Wikibase data model to a graph based form:
item: vertex with edge to statement, edge to sitelink
statement: vertex with exactly one edge to property, exactly one edge to value|item, edge to qualifier, edge to reference
value: vertex
qualifier: vertex with exactly one edge to property, exactly one edge to value|item
reference: vertex with exactly one edge to property, exactly one edge to value|item
sitelink: vertex
property: vertex with edge to statement
(Except where noted edges can occur 0 or multiple times. The key-value-pairs on vertexes and edges omitted, those values are only scalars.)

I also put some more comments to the discussion page https://www.mediawiki.org/wiki/Talk:Wikibase/Indexing/Data_Model - I think it makes sense to discuss/clarify things there, but if anybody thinks there's a better place please tell.

https://www.mediawiki.org/wiki/Wikibase/Indexing has been updated according to the comments. Main change - claims are now vertices.

Doing the import with the new model I see that the import is significantly slower when claims have their own vertices. Not sure if it's a big deal or not. If it's an issue we may want to reconsider going back to "claims as edges" model.

Maybe worth checking out this: http://www.tinkerpop.com/docs/3.0.0.M6/#vertex-properties
Titan 0.9 has TinkerPop 3, which has significantly expanded property model - in particular, the property can have multiple other properties attached to it, and itself can have multiple values.

Re performance and indexing, from a mail thread:
Earlier today Stas & I were looking a bit into what is happening behind the scenes in some of the slower queries like https://www.mediawiki.org/wiki/Wikibase/Indexing/Benchmarks#List_of_humans_having_occupation_writer_but_not_author

We found that in a query that uses predicates on more than a single property per vertex, the default titan query strategy is to retrieve a list of candidates based on the first index (fast), and then load details for each vertex in order to filter on the second predicate. This is pretty slow, as it involves O(candidates) queries to Cassandra.

I think mixed indexes as documented in http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should help here, as they support efficient matching on an arbitrary combination of attributes, along with advanced range and full text queries. They do require an indexing backend like Elasticsearch to be configured. Nik, is there an Elasticsearch setup we could use, or would it be reasonably easy to spin up a new one?

Stas replied:

I've looked into this matter, and there's one wrinkle in this - mixed indexes now don't support Date and SET fields, which means we can not index everything - some fields (among them frequently used date start/end fields and P*link fields which are SET) would not be indexable with Elastic. They say it's a "temporary limitations" but so far 0.9 still has it.

Dates are actually interesting re the ranges we need to represent. A signed long second resolution unix timestamp (as used by Java Date) reaches from 292269055 BC to 292278994 AD. Not quite enough to represent the current estimates for the Big Bang for example. ISO 8601 timestamps support much less range, even in the six-digit extended mode. Maybe it makes sense to index years separately as a long? This will also be fun in JSON.

For SET I could imagine a full-text index on a string with each P*link property separated by a space. Elasticsearch should be good at indexing that. Not sure how much more complex that would be to maintain compared to the current SET index.

For SET it won't be more complex to maintain, probably, but I'm not sure if the lookups would be fast enough. I could create an additional field for that and see how it behaves, and then we could drop the field that is not needed.

For Date, I wonder if support can't be added to Titan, since Elastic AFAIK supports dates.

Also, while Java can support negative years, right now the import does not support them since the Java parser fails to properly parse them. If we're moving to Java 8, there are better date APIs AFAIR, so maybe they allow to handle it properly.

I think mixed indexes as documented in http://s3.thinkaurelius.com/docs/titan/current/indexes.html#index-mixed should help here, as they support efficient matching on an arbitrary combination of attributes, along with advanced range and full text queries. They do require an indexing backend like Elasticsearch to be configured. Nik, is there an Elasticsearch setup we could use, or would it be reasonably easy to spin up a new one?

I'd spin up a new one - probably just on a single node. I think in the long run we probably can run this on the production search cluster but for now lets keep it off just in case it does something stupid. I can put together some puppet changes to put a single node elasticsearch instance on einsteinium.

For Date, I wonder if support can't be added to Titan, since Elastic AFAIK supports dates.

It sure does. They are parsed and formatted automatically but amount to a java long since epoch under the hood. As @GWicke said that means they can't reach back until the big bang. If instead of dealing in dates we dealt in _seconds_ since epoch we could reach back to the big bang so long as current estimates are right to within an order of magnitude. Instead of 292 million years ago we'd have 292 billion years ago.

Also, while Java can support negative years, right now the import does not support them since the Java parser fails to properly parse them. If we're moving to Java 8, there are better date APIs AFAIR, so maybe they allow to handle it properly.

Elasticsearch is based on Joda Time which can handle negative years just fine. It can't handle negative years that far back though. I've filed an issue for it but I imagine we'll be on our own. I believe they are getting infinite precision numbers at some point but the kind of dates we handle are probably best stored in floating point instead.

For SET it won't be more complex to maintain, probably, but I'm not sure if the lookups would be fast enough. I could create an additional field for that and see how it behaves, and then we could drop the field that is not needed.

Elasticsearch totally supports sets. Like, so supports. We use them for stuff like hastemplate and incategory. Have a look at the dump for a page. Lucene has native support for sets of strings, sorta. Its a very leaky abstraction around pretending that they are one big string with junk tokens between them. The junk tokens prevent phase searches from spanning multiple values in the set.

Looking at mixed indexes I wonder how they are backed to Elasticsearch. Lucene/Elasticsearch pretty much indexes everything independently and then ANDs the results of traversing multiple indexes together to get the answer. That deserves some looking into.

I'd spin up a new one - probably just on a single node. I think in the long run we probably can run this on the production search cluster but for now lets keep it off just in case it does something stupid. I can put together some puppet changes to put a single node elasticsearch instance on einsteinium.

Awesome, that would be great! I think storage is not very fast there (disks IIRC), but maybe the 64G of memory will compensate.

For Date, I wonder if support can't be added to Titan, since Elastic AFAIK supports dates.

It sure does. They are parsed and formatted automatically but amount to a java long since epoch under the hood. As @GWicke said that means they can't reach back until the big bang. If instead of dealing in dates we dealt in _seconds_ since epoch we could reach back to the big bang so long as current estimates are right to within an order of magnitude. Instead of 292 million years ago we'd have 292 billion years ago.

Seconds wouldn't actually work, but years will, especially if represented as a double. The big bang represented as years comfortably fits into the 52-bit mantissa of a double, so we wouldn't get weird rounding artifacts like reading back 13.834520923424352345234523 billion years after storing 13.8. But we'd have plenty of range for other cosmic dates.

Elasticsearch is based on Joda Time which can handle negative years just fine. It can't handle negative years that far back though. I've filed an issue for it but I imagine we'll be on our own. I believe they are getting infinite precision numbers at some point but the kind of dates we handle are probably best stored in floating point instead.

Parsing out the year portion of an ISO 8601 timestamp in the API frontend is not all that hard, so we could just go ahead and index on a double representation of that in Titan. For dates within the long cut-off, we can also convert that to long (or double), and store that in a finer-grained index. We'll then need to switch between year-based indexing and finer-grained indexing in the front-end after looking at a string representation of a timestamp in the query.

Looking at mixed indexes I wonder how they are backed to Elasticsearch. Lucene/Elasticsearch pretty much indexes everything independently and then ANDs the results of traversing multiple indexes together to get the answer. That deserves some looking into.

The documentation says that it efficiently supports multi-predicate queries on the same index, so I assume that it sends off all predicates to elasticsearch at once, and lets it deal with efficiently ANDing.

Elasticsearch totally supports sets.

Right, but Titan unfortunately doesn't support mixed indexes on SET properties. I would assume it's not a hard limitation but rather them not getting to implementing it yet. The mixed index type support is very limited now - for one, they don't support booleans, for example.

Proposed storage format for dates:

  1. Dates are stored as long signed integers, representing number of seconds since 1970-01-01 00:00:00 UTC.
  2. This gives us range of 292 bln years.
  3. When parsing the date, if they year is below 292000000 (by absolute value) and positive we use Java functions, and hope they work right with the calendar, etc.
  4. Beyond year 292M, or for negative dates, we just count in whole years - i.e. the number stored is years*SECONDS_IN_YEAR
  5. Negative dates are handled as years since I see no sane choice for calendar to apply to such dates

This allows us to keep reasonable accuracy (to the second range) within wide range of dates, while still preserving the sequence - i.e. the Big Bang can be listed as "even that happened before year 2000" - and keeping the single property and data format, so to avoid complicating the data handling code with special cases.

We would probably want to define helper functions to convert from dates to seconds and back, for convenience.

Proposed storage format for dates:

  1. Dates are stored as long signed integers, representing number of seconds since 1970-01-01 00:00:00 UTC.
  2. This gives us range of 292 bln years.
  3. When parsing the date, if they year is below 292000000 (by absolute value) we use Java functions, and hope they work right with the calendar, etc.
  4. Beyond year 292M, we just count in whole years - i.e. the number stored is years*SECONDS_IN_YEAR

We should double-check that this transition is actually monotonic. In doubt, we can use the timestamp for year 292M according to Java + (year - 292M) * SECONDS_IN_YEAR.

Overall, I think this is a nice solution. Avoids the complexity of two separate indexes while still providing second resolution across the entire range.

@GWicke Do we really need per-second precision transitioning between year 292M and 292M+1? I'm not sure it is ever required. We should ensure, of course, that seconds(292M-12-31T23:59:59) < seconds(292M+1) and also the same for low dates, but beyond that I'm not sure anything more precise would be of any use, as I can imagine no calendar dates with such years that would make any sense.

seconds(292M-12-31T23:59:59) < seconds(292M+1)

It's pretty likely that there were more than 365 leap years between -292M and 1970, so it's very possible for the years to be non-monotonic if we don't use the Java value (which should include leap years) as the base.

For dates beyond real Gregorian calendar, the values more precise than years have little meaning anyway, so I don't think it matters too much as long as comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) work.

For dates beyond real Gregorian calendar, the values more precise than years have little meaning anyway, so I don't think it matters too much as long as comparisons and lookups (i.e. which Greek philosopher was born in 427 BCE) work.

Agreed. But, we need monotonicity at least at the year level. Without that we also don't have bijectivity.

Smalyshev triaged this task as Medium priority.Dec 30 2014, 6:12 AM