Page MenuHomePhabricator

Determine Hive format for Metrics Platform bespoke data
Closed, ResolvedPublic

Description

I want to present some of the options we have for storage of a bespoke data structure in a Hive table for discussion.

Key goals for us are

  • Simple to query
  • Flexibility for upstream instruments to add/remove/change fields without the need for a database migration
    • in particular, this means that the storage format should be fixed, and the type of the data value can be changed without changing the column type
  • as much type-safety and performance as we can have without compromising the above

After working through alternatives, here are the options I was able to create. In these examples, the bespoke structure, as it appears in JSON, will hold multiple pieces of non-standard data under a single field, called bespoke. Each example below is labeled with the Hive type spec, and is followed by how the structure would look on the wire in JSON.

Option A
bespoke STRING

bespoke: "{ lucky_number: 1, message: 'hello, world!' }"

The bespoke field's value is a string, in particular a JSON string which can be parsed into proper JSON at query-time using json_object() or json_tuple().

Option B
bespoke MAP<string, string>

bespoke: { 
  lucky_number: '1', 
  message: 'hello, world!' 
}

The bespoke field's value is an object with string keys and string values. Even types which are not strings must be converted to strings before being sent.

Option C
bespoke MAP<string, STRUCT<type: string, value: string>>

bespoke: { 
  lucky_number: { 
    type: 'integer', 
    value: '1'
  }, 
  message: {
    type: 'string', 
    value: 'hello, world!'
  } 
}

Rather than force all values to be strings, we allow query-time casting of values by using a type hint. Not ideal since we don't want to be casting at query time if we can help it.

Option D
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean>>

bespoke: { 
  lucky_number: 1, 
  message: 'hello, world!' 
}

This accommodates the flexible use of JSON with primitive types, allows the types to change, and is type-safe. It does not support complex types except if encoded as strings.

Option E
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<integer>, ARRAY<float>, ARRAY<string>, ARRAY<boolean>>

bespoke: { 
  lucky_number: 1, 
  message: 'hello, world!', 
  interval:[0,1] 
}

This accommodates everything in Option D, and also allows arrays of fixed primitive types.

Option F
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>>

bespoke: { 
  lucky_number: 1, 
  message: 'hello, world!', 
  interval:[0, 'wow'] 
}

This accommodates everything in Option E, but allows the arrays to be of mixed primitive types.

Option G
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>, MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>>>

bespoke: { 
  lucky_number: 1, 
  message: 'hello, world!', 
  interval:[0, 'wow'], 
  notes: { 
    color: 'blue', 
    length: 50 
  } 
}

This accommodates everything in Option F, but also allows for objects mapping strings to primitive types (i.e. only one level of nesting).

Option H
bespoke ARRAY<STRUCT<name: string, type: string, value:string>>

bespoke: [
  {
     name: 'lucky_number',
     type: 'integer',
     value: '1'
  },
  {
     name: 'message',
     type: 'string',
     value: 'Hello, world!'
   }
]

This approach uses ARRAY instead of MAP as the outermost complex type. This has the benefit of avoiding collisions, but makes queries more difficult since dealing with arrays is often awkward.

Option I -- ?
Any of the Options B-G can be accommodated in the ARRAY style by making the MAP key an element of the MAP within the ARRAY item, e.g.

{
    lucky_number: 1,
    message: 'Hello, world!'
 }

becomes

[
    { name: lucky_number, value: 1 },
    { name: message, value: 'Hello, world!' }
]

Event Timeline

My preference would be for one of options B, D, or G, in order of increasing flexibility. My sense is that Option G would be able to accommodate the vast majority of data fields that people will want to carry, namely:

  • basic types (integer, float, string, boolean)
  • arrays of mixed basic types
  • objects mapping strings to basic types

These options have the drawback of not being collision-resistant. But it seems to me that if we solve that in other ways, the flexibility and ease of querying would be worth it.

Nice write up! :)

FYI, I don't think supporting UNIONTYPES is going to be possible. It isn't just Hive that has to work with this data.
Any typed system would need special code written to know how to deal with union types. https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#No_union_types_/_No_null_values

Nice write up! :)

FYI, I don't think supporting UNIONTYPES is going to be possible. It isn't just Hive that has to work with this data.
Any typed system would need special code written to know how to deal with union types. https://wikitech.wikimedia.org/wiki/Event_Platform/Schemas/Guidelines#No_union_types_/_No_null_values

FYI, I don't think supporting UNIONTYPES is going to be possible. It isn't just Hive that has to work with this data.

By "things that aren't Hive", do we mean things upstream of Hive? Upstream, the JSONSchema would not need to be a union type, only when it was inserted into the table. Or do we mean things downstream of Hive?

@Ottomata: I think you accidentally double-posted :)

FYI, I don't think supporting UNIONTYPES is going to be possible

If I remember right the idea is that this is for the tables derived/refined from the raw data. The event schemas will not have union types.

I am curious if Spark and Presto can work with such data, though.

I think it would be really useful to have a demo table with 2-3 rows and columns bespoke_a, …, bespoke_h to be able to see what these formats are like to work with in HiveQL, SparkSQL (via wmfdata's PySpark wrapper), and Presto (via Superset's SQL Lab and wmfdata's Presto module)

Ah hm, do you mean you are going to read from the Hive event tables and then insert into a new Hive table that you create with UNIONTYPE? I think that will work (at least until you try to export that data somewhere else like Druid or Cassandra).

But, if you mean you want data with variable types in the JSON event data to be inserted into Hive...it could be possible but it would mean some complicated type inference code and schema evolution code. Any ingestion automation code that touches the data would have to know how to do it. E.g. this, this, this, etc.

Action is till required on this ticket regardless of outcome of testing

@jlinehan @Ottomata @DAbad as I remember from our discussion on 1 September, we decided on Option B. Does that still ring true?
If so, I think we can close this and open followup tasks as needed.

DAbad moved this task from Done to To be discussed on the Metrics Platform Backlog board.

Since we are no longing releasing MVP, we have more time to consider this and determin the best path forward. We are going to leave this open for now.

A brief update:

The current iteration of the proposed Metrics Platform Event schema implements option C.

My preference would be for one of options B, D, or G, in order of increasing flexibility.

Regrettably, D through G have to be discarded because of T284147#7129129.

H and I were also discarded because dealing with arrays is, as noted in the description, awkward. It's not just awkward for analysts, mind you! In the case of a cross-platform instrument, the instrumentor would have to maintain the order of the custom data across implementations.

B and C are superior to A insofar as they enforce a predictable structure on the bespoke data. A and C are superior to B insofar as they encode information about the type of the original value, thereby minimising the amount of coordination between analyst and instrumentor.

Finally, as noted in the description, C requires that the analyst cast to the desired type at query time. However, this casting will be done as part of refining the raw data that @mpopov mentioned in T284147#7129142.

Removing inactive assignee from this open task. (Please update assignees on open tasks after offboarding. Thanks.)