Page MenuHomePhabricator

Define which data the query service would store
Closed, ResolvedPublic

Description

Currently, the query service import process only imports the data listed here: https://www.mediawiki.org/wiki/Wikibase/Indexing/Data_Model

For example, for items it only imports labels and claims. Do we need to import any more data? If so, what exactly?

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)
Smalyshev moved this task from Incoming to Needs Review/Feedback on the Wikidata-Query-Service board.
Smalyshev added a subscriber: Smalyshev.

Here we would like some feedback from wikidata team.

Lydia_Pintscher added a subscriber: Lydia_Pintscher.

Things I can think of right now - maybe more:

  • definitely should also import aliases. (Might already but not in the description here so making sure.)
  • probably also should import sitelinks. There were a few useful searches community members did with them for cleanups and statistical purposes like finding the wikis with highest percentage of articles in a specific area
  • badges

Things I assume are covered:

  • references
  • sitelinks
  • labels, aliases and statements for properties

That leaves descriptions. I don't think we need to index them?

Only things explicitly mentioned in the wiki are covered now. So, qualifiers are covered, but references, sitelinks, badges and aliases are not.

I'd like to hear more what lookups would we do against sitelinks, aliases, badges and references - do we have any existing examples or ideas of queries?

Some examples:
References: "Give me all statements referenced to Nature" or "Give me all cities in Germany over 2 Million inhabitants based on the national statistics office"
Aliases: The same as labels
Sitelinks: "Give me all people that have an article on Malayalam Wikipedia"
Badges: "Give me all people that have an excellent article on English Wikipedia"

I would like to turn it around. We should support indexing everything:

  • Labels
  • Aliases
  • Descriptions
  • Sitelinks
    • Including badges
  • Statements
    • Including ranks etc
    • Including qualifiers
    • Including references

(probably forgot something)

We should have a good reason to not index something. The fact that we're not creative enough to make up queries for everything doesn't mean it isn't useful.

I would like to turn it around. We should support indexing everything:

...

The fact that we're not creative enough to make up queries for everything doesn't mean it isn't useful.

I have to disagree with this approach of designing a (query) system. Of course supporting "everything" would be nice, but indexing always needs to be viewed in the context of a query language. Depending on the query features you support, "indexing" may mean completely different things. The requirement that "everything should be indexed" is fuzzy and vague.

For example, Wikidata Query supports regular path queries (in Wikidata Query, Kleene-star recursion is accessed with an operator called "TREE"). When you say that references should be indexed, do you mean that it should be possible to navigate through references within regular path expressions? How about qualifiers? I think Wikidata query only supports TREE in the main statements. On the other hand, Wikidata query does not support any kind of cyclic queries ("Find people whose parents are married to each other.") even though the relevant data is indexed. The point is that you need adequate index structures for each query feature. You may have "indexed everything" and still not be able to do the queries you want.

Moreover, I am creative enough to come up with queries that would be very hard to implement, yet this does not mean that we should do it. Both "everything" and "everything we are creative enough for" are poor design principles.

So how to move forward? The usual approach to design a practical system is to collect use cases and requirements (example queries) and then make a clear decision what should be supported and what shouldn't. One can always revise this decision later, but it's still much better to make it explicit than to just go along and see what we get. In all of this, it must be understood that supporting "everything" is an impossible task. The question is merely where to draw the line(s).

On another note, it would be good if the view on all data that is in the system is somewhat uniform. We don't want special query syntax for badges etc. This could be avoided by viewing everything as statements, possibly using some "special" properties (and qualifiers). For example, a label can be structurally represented like a statement for a property of type monolingual text. A sitelink could be represented as a statement with main property pointing to the article title and qualifiers defining the site and badges. Doing this does not change the data, but it would unify the query syntax.

Something that should *not* be represented in the query service is order. The order of statements relative to other statements (of the same property or not), the order of qualifiers, the order of reference snaks in a reference, the order of aliases -- none of this should play a role in query answering as such. Whan displaying data, it could still use the order as in Wikidata, but this is not related to indexing (e.g., it should not be possible to search for entities with third alias starting with "a" and the fifth statement from the top being about property P31).

I understood "indexing everything" to mean that at least support answering a query for it with an exact value and traversing to things it is connected to without iterating over anything else. (Technically for Titan a composite index over one key, see http://s3.thinkaurelius.com/docs/titan/current/indexes.html#_composite_index .) I.e. support finding badges "Featured Article" and also support traversing to the site link one is connected to and then which item that is connected to. So this does not imply any fancy indices (like over functions) nor even full text, prefix or range indices.
Does it sound OK to have this as a baseline?

If there are no other indices that means e.g. that we can answer "Find people whose parents are married to each other." only by looking at every human that is married (which so far is indexed by exact value for each of the lookups on its own; i.e. no combined index for married humans) and then traversing to their children where both parents are in the previous set of humans (which is AFAIK currently not helped by any index).

A question about qualifiers: currently if there are two qualifiers with the same property on the same statement, it will be split when importing into Titan like as if it were two statements, is this something that is acceptable or should we correct this now?

@Multichill I think there is a very good reason not to index everything. The reason is that everything has costs. If we had a system with zero index cost and infinite speed, of course, no problem. But in any real system creating indexes costs, storing indexes costs and retrieving indexes costs, and these costs grow, often in non-linear fashion, the more data we put into the system. Thus, there are very good reasons not to index something that we do not need indexed, and not to store something we do not need stored - it would hinder the performance of the things we do need indexed and looked up quickly.

@mkroetzsch Are you sure you want to unify query syntax this way? Claims are different from sitelinks - you can ask "give me sitelinks with "Featured" badge" but you can not ask "give me claims with "Featured" badge" because claims do not have badges. On the other hand, you can ask "give me claims that refer to the 20th century" but you can not ask "give me sitelinks that refer to 20th century" - because claims have qualifiers and sitelinks do not. So I am not sure it makes a lot of sense to unify them. With labels, I could represent them as claims, but that would be less efficient, and would not really improve much, since you can not really apply to labels any complex queries you can apply to claims (qualifiers, trees, references) so you do not win anything representing them as claims. Moreover, the function of labels right now is more for display (it's really hard to understand results as Q30 or Q33, "USA" and "Finland" are much better) than for anything else.

@JanZerebecki what you mean by "support traversing to the site link one is connected to and then which item that is connected to"? If we implement sitelinks and badges, the ability of looking up for specific badge value will be present, though looking up all "Featured Article" owners may be expensive if there's a lot of them. I'm not sure I understand the traversing part.

currently if there are two qualifiers with the same property on the same statement, it will be split when importing into Titan like as if it were two statements, is this something that is acceptable or should we correct this now

Yes, this is true, but this is a technical detail of indexing. The reason for this is that indexes need simple values to index, so if we have three values for P123 in the claim, we can not put them into one variable since this could not be consumed by the index. The most simple and efficient way is to make it really three values and let the engine index each of them. There are other solutions possible but they are more cumbersome to work with, so I am inclined to stay with this one unless there's a problem identified with it that would not allow to use it.

As a side note, this also allows proper grouping of paired properties like start date/end date, which are really meaningless if they are stored together as one bag and not as a set of pairs.

@Smalyshev My suggestion was just about the surface appearance, not about the inner workings. I am saying that the following two phrases have the same structure:

  • "Find things with a *sitelink* that *has badge* *featured*".
  • "Find things with a *population* that has *point in time* *2014*".

If you look at it like this, you use "badge" like a (special) qualifier property and "featured" like a value. This does not mean that the query answering will be any less efficient than with another syntax. The query engine would easily parse the queries in any case, without ambiguity, and know that sitelinks are (possibly) stored in a different way internally. Same for labels. The reason I was suggesting this unification here was that it also somehow answers the question "what do we mean by 'indexing' this data?": any query that would work over statements would also be expected to work for sitelinks, even if different structures are used internally.

@mkroetzsch the issue is that "point in time" is an official property in wikidata, which can be looked up and associated with property P585. However, "has badge" is not a property and as such can not be expressed in these terms. Of course, we can make a language that would have some logic - either database-backed or just whitelist-based - that would identify that "has badge" is a special case and create special translation for it. That would complicate the engine, but it's not the main issue. I wonder if it is really what the expectations of the users would be. After all, we display links and badges in different form than claims in the main wikidata UI - wouldn't the users of the same UI expect different handling of the sitelinks and badges in the query API too?

any query that would work over statements would also be expected to work for sitelinks

I'm not sure we want to make that claim, because I don't think it is true. The data model is different for sitelinks as opposed to claims - claims have values, qualifiers and references, while sitelinks have badges. In fact, there's very little in common between the two of them except for the fact that they are both attached to items. So I am concerned that claiming that would be rather confusing.

@Smalyshev My point is merely that sitelinks and labels can be handled like statements. Since statements must be supported anyway, it would be sensible to reuse the data structures and query expressions defined for them. I don't think that confusion is likely, since the query language will not use the colloquial names as my examples. Properties of Wikidata will always be referred to by their Pid, whereas something like "has badge" would not have an id of this form. So it's not like having a reserved label "has badge" that competes with Wikidata property labels.

Structurally, however, statements and sitelinks can all be represented in the same data structure as far as querying is concerned. Maybe you would like it better if you viewed it as a separate, independent data structure "qualified triple" that we would use to represent statements, sitelinks, and labels? There is no conceptual mix-up between the high-level terms here, just taking advantage of similar structure at a low level. If you look at common query languages like SQL, SPARQL, Cypher, etc. then you can see that they are always based on a relatively small set of structural primitives that do not have a domain-specific meaning. You can always build UIs that use domain specific terms like "sitelink" and that make them appear separate, but for implementers and API users it is very useful if some things can be unified.

@JanZerebecki I understand what you are saying about what "indexing" means here. Makes sense to me. What you are saying about my example query sounds as if you are planning to implement query execution manually. I hope this is not the case and you can just give the query to Titan to get it answered for you.

You mentioned splitting certain statements with duplicate qualifiers. This changes the structure, and the original structure is no longer represented and can no longer be faithfully recovered. I don't know if this is an issue with the current data (which duplicate qualifiers are actually used?) but it is an issue in general. It also means that with a mere 40 qualifiers (20 properties, each with two values) on one forged statement, I could create a million statements in your index -- a possible DOS attack vector?

An alternative technique for handling such cases is to use a special encoding for overflowing values. This is done successfully in some database implementations, but it may mean additional checks during query answering, and depending on how low in the query answering process you can do these checks, they may or may not add significant cost (hence it's a pity that Titan does not support this efficiently out of the box).

This changes the structure, and the original structure is no longer represented and can no longer be faithfully recovered.

This is not correct, original structure can be recovered, though I see no reason why would you want to do so. Can you name one?

a mere 40 qualifiers (20 properties, each with two values) on one forged statement, I could create a million statements in your index -- a possible DOS attack vector

Scan of the database shows there are no entries generating more than 15 qualifier splits (at least I couldn't find any a month ago). If it ever becomes a problem, we could easily institute limits, but I think vandalism is better handled on other levels than changing our data model to avoid vandals. I see no case where 20 duplicate qualifiers would legitimately be required - duplicate qualifier is usually a wrong way to represent the claim, since it essentially claims that the same event happened in two places, two times, etc. - which usually means what should there be is two separate claims, as event happening two times is two different instances of the event. So I would claim even most existing duplicates look like data errors.

This comment was removed by Smalyshev.

This is not correct, original structure can be recovered

Then I misunderstood the transformation that was proposed. My impression was that a statement with three qualifier snaks: P1 V1, P1 V2, P2 V3 would be stored as two statements, one with qualifiers P1 V1, P2 V3, and one with qualifiers P1 V2, P2 V3. In this case, one would not be able to distinguish this from the case where two statements with two qualifiers each had been given originally. Could you explain what kind of transformation you had in mind?

Scan of the database shows there are no entries generating more than 15 qualifier splits

My point was that an attacker could craft a single statement that makes you index millions of statements. It's clear that such statements are not in the current data, since they are hopefully not needed. Again this depends on my (possibly incorrect) understanding of your intended transformation.

In this case, one would not be able to distinguish this from the case where two statements with two qualifiers each had been given originally

It is possible to distinguish them since claim IDs are recorded too for bookkeeping, so the split claim would have same IDs while different claims would have different IDs. I'm still not sure why this distinction is important though.

My point was that an attacker could craft a single statement that makes you index millions of statements.

It is easy to introduce limits if this would be of any concern. Since our data does not have any large numbers, limiting expansion factor by, say, 50 or so would not impact the system and would prevent such problems.