Page MenuHomePhabricator

Figure out quantity representation
Closed, ResolvedPublic

Description

Right now we represent quantity as string, copying it as-is from the JSON dump. This is inefficient as it can not be properly indexed. We should consider representing it as Long or Double or some other way that would allow to index it properly.

This probably needs more guidance from Wikidata team as to what "quantity" values may actually contain.

Event Timeline

Smalyshev claimed this task.
Smalyshev raised the priority of this task from to Medium.
Smalyshev updated the task description. (Show Details)

From WDQ source, WDQ treats it as float. So maybe Double precision is enough?

I think I'll decide to represent them as Double for now. If anybody objects, we can reopen and change it.

Turns out there is a complication - Titan can not use floats (including double) in Vertex indices: http://s3.thinkaurelius.com/docs/titan/0.5.2/common-questions.html#_floating_point_numbers_in_vertex_centric_indices

Need to figure out the consequences of this so leaving it open for now.

Float would be ok, double would be better. Either should work for most applications, but may cause incorrect results for very small or very large numbers (say, mega-lightyears measured in meters). I think this is acceptable, at least for now.

But Titan not allowing floats or doubles for indexing is bad. The "Precision" type (6 decimal digits) is probably our best option, but it will fail already for anything measured in nanometers or micrograms.

Titan allows indexing the floats, but not in vertex-centric indexes. Elasticsearch indexes support floats, for example. I'm not sure what is the actual impact of this limitation yet - it probably depends on the kind of lookups we would do, as for some vertix-centered indexes may be important and for some they may be irrelevant. We'll need more research into this. If this proves to be a problem, we could have a work-around - say, a duplicate property with Precision data type, with lower resolution just for specific lookups that benefit from indexing.

In any case it's probably not worse than keeping them as strings, as in the latter case indexes can not be used for anything like "more", "less", "between", etc. queries and may not even work for exact value matches.

No, I don't think unit localization matters for any WDQ service as it is only about how to format them (though that might for a fancy query builder or something that displays search results to end users). However units T77977 themselves matter. Until now we only had quantities with unit 1. Going forward we will have quantities with units that are on wikidata.org represented by items (similar to coordinate globes).

@JanZerebecki right, so we need to figure out how to efficiently represent this, because indexes can not work with values expressed in different units. Right now the engine assumes all the quantities are homogeneous, but if they are not, how one would compare them? I'm not sure it's even automatically possible - i.e. if somebody specified the weight in ounces, there's like 11 of them, and even if we could rely on all conversion values being accurately represented in the database (which I'm not sure at all we can), how would we know which unit converts to which? There should be then some hierarchy defined, with standard (SI?) units for quantities and each non-standard unit should have conversion procedure to standard. The procedure also may be not multiplication alone - i.e. centigrade<->Fahrenheit conversion. So this opens pretty big can of worms with regard to automatic processing.

I think the idea is to do the conversion when the data gets into the query index and then query only on homogenous data.

@Lydia_Pintscher sounds good, but that means every unit should have definition of which unit it converts to and how to convert it. Maybe as part of T77978.

Yeah agreed. And for the units where we don't have conversions we can only offer querying for them as they are anyway.

Also we probably need to get to implementing value ranges/precision eventually.