I want to present some of the options we have for storage of a bespoke data structure in a Hive table for discussion.
Key goals for us are
- Simple to query
- Flexibility for upstream instruments to add/remove/change fields without the need for a database migration
- in particular, this means that the storage format should be fixed, and the type of the data value can be changed without changing the column type
- as much type-safety and performance as we can have without compromising the above
After working through alternatives, here are the options I was able to create. In these examples, the bespoke structure, as it appears in JSON, will hold multiple pieces of non-standard data under a single field, called bespoke. Each example below is labeled with the Hive type spec, and is followed by how the structure would look on the wire in JSON.
Option A
bespoke STRING
bespoke: "{ lucky_number: 1, message: 'hello, world!' }"
The bespoke field's value is a string, in particular a JSON string which can be parsed into proper JSON at query-time using json_object() or json_tuple().
Option B
bespoke MAP<string, string>
bespoke: { lucky_number: '1', message: 'hello, world!' }
The bespoke field's value is an object with string keys and string values. Even types which are not strings must be converted to strings before being sent.
Option C
bespoke MAP<string, STRUCT<type: string, value: string>>
bespoke: { lucky_number: { type: 'integer', value: '1' }, message: { type: 'string', value: 'hello, world!' } }
Rather than force all values to be strings, we allow query-time casting of values by using a type hint. Not ideal since we don't want to be casting at query time if we can help it.
Option D
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean>>
bespoke: { lucky_number: 1, message: 'hello, world!' }
This accommodates the flexible use of JSON with primitive types, allows the types to change, and is type-safe. It does not support complex types except if encoded as strings.
Option E
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<integer>, ARRAY<float>, ARRAY<string>, ARRAY<boolean>>
bespoke: { lucky_number: 1, message: 'hello, world!', interval:[0,1] }
This accommodates everything in Option D, and also allows arrays of fixed primitive types.
Option F
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>>
bespoke: { lucky_number: 1, message: 'hello, world!', interval:[0, 'wow'] }
This accommodates everything in Option E, but allows the arrays to be of mixed primitive types.
Option G
bespoke MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>, MAP<string, UNIONTYPE<integer, float, string, boolean, ARRAY<UNIONTYPE<integer, float, string, boolean>>>>
bespoke: { lucky_number: 1, message: 'hello, world!', interval:[0, 'wow'], notes: { color: 'blue', length: 50 } }
This accommodates everything in Option F, but also allows for objects mapping strings to primitive types (i.e. only one level of nesting).
Option H
bespoke ARRAY<STRUCT<name: string, type: string, value:string>>
bespoke: [ { name: 'lucky_number', type: 'integer', value: '1' }, { name: 'message', type: 'string', value: 'Hello, world!' } ]
This approach uses ARRAY instead of MAP as the outermost complex type. This has the benefit of avoiding collisions, but makes queries more difficult since dealing with arrays is often awkward.
Option I -- ?
Any of the Options B-G can be accommodated in the ARRAY style by making the MAP key an element of the MAP within the ARRAY item, e.g.
{ lucky_number: 1, message: 'Hello, world!' }
becomes
[ { name: lucky_number, value: 1 }, { name: message, value: 'Hello, world!' } ]