Page MenuHomePhabricator

Support NULL values in RowData in eventutilities
Open, Needs TriagePublic

Description

When using a Row, we can omit fields that we don't need by just not setting them in the first place.
Given an example schema:

test:
  type: string
test_int:
  type: int

You can create some row

private final EventRowTypeInfo typeInfo...

Row r = typeInfo.createEmptyRow();
r.setField("test", "test_string");

And when serialized results in

{
  "test": "test_string"
}

And only when this event is ingested into Hive does the unset columns get NULL.

But when using RowData, which directly mirrors the SQL schema, unset fields default to NULL in the first place, so an insert into the Flink catalog

INSERT INTO `example.schema` (`test`) VALUES ('test_string');

Results in

{
  "test": "test_string",
  "test_int": null
}

Which then fails the JSON schema validation before it can be sunk.

We either need to support removing null object nodes when generating the JSON from RowData, or make it a rule that users of the Flink Catalog must provide values for all fields.

This also calls into question how default values should be handled. Does someone omitting a field with a default value mean that they want it to have the default value, or that they want it to be NULL?

Event Timeline

We either need to support removing null object nodes when generating the JSON from RowData,

I think you are right. Right now, we have two options for where in event-utiltities we might do this.

  1. As an EventNormalization step. This could possibly be done in the Validation step, as this step has the JsonSchema, and check that the type is is not a Null type (Event Plafform doesn't support null types, but probably we should still be careful).
  1. In the RowDataJsonConverters createRowConverter. Here, if the value is null, then we just wouldn't set the field on the node.

I prefer option 1. because then we can avoid yet another modification to an upstream copied class.

This also calls into question how default values should be handled. Does someone omitting a field with a default value mean that they want it to have the default value, or that they want it to be NULL?

Hm, a really great question. EventGate's choice is to set the default value. This would be ambiguous if Event Platfrom supported null types. It doesn't, so I think our convention should be the same: If a field has a default value, a library may set it to that value if it is omitted (or set to null) by the producing code. I say 'may' because it depends on if the producing library (EventGate, event-utilities, etc.) supports setting JSONSchema defaults. EventGate does. Right now, event-utilities doesn't.

I think we should probably make event-utilities set default values. It doesn't look like the JsonSchemaValidator we use supports this, so either we need to use another validator, or implement setting the defaults ourselves.
If we are going to implement removing the null fields in our Validator EventNormalization step, setting defaults at this time won't be difficult.

So, in summary, I think we should:

  • Modify Validation EventNormalization Step to recurse through the JsonSchema and:
    • Set default value if the schema has one, and the field is null or omitted.
    • Remove fields that are null.