Page MenuHomePhabricator

Rename column on old hive data for a few tables
Closed, DeclinedPublic

Description

I'm in the process of unifying a column name between different EventLogging schemas, to make for less confusing joins. However, when querying data before the rename, the old column name is the one that has data, on more recent rows, it's the new column. This is less than ideal.

Is there a simple way to rename the column in hive rows that were recorded prior to the column name change?

The tables that will need this are NavigationTiming (very large) and QuickSurveysResponses/QuickSurveyInitiation (small)

Event Timeline

However, when querying data before the rename, the old column name is the one that has data, on more recent rows, it's the new column. This is less than ideal.

Right, the renaming of the schema is seen by the persistence storage as "removal" of 1 column and "addition" of 1 column. Our recommendation is to keep changes in schemas backwards compatible so they can be persisted easily, while json schema gives you a lot of freedom you are limited by what you can persist on hive. A removal of a field is not backwards compatible.

Is there a simple way to rename the column in hive rows that were recorded prior to the column name change?

While there is an alter table syntax is not trivial to do this as some partitions are receiving live data, will consult with team.

We talked this over and are hesitant to do this kind of work, because we want to encourage backwards-compatible changes going forward. Is it too much of a pain for you to build a view/something similar on top of this that you can query instead? Like this: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView

I'll give views a shot, thanks. I wasn't aware those were available.