Page MenuHomePhabricator

Refine drops $schema field values
Open, HighPublic

Description

In order to do our best at handling backwards incompatible data in EventLogging, when reading raw input JSON data during Refine, we do so with a schema made by merging the Hive tables schema + event JSONSchema. We assume that the Hive table has all potential fields it has ever seen, and the JSONSchema might have new ones.

When we merge the schemas, we also normalize them to avoid casing (and other) differences between SQL and non-SQL systems. Our field normalization also converts SQL incompatible chars in field names to '_'. So, $schema becomes _schema. When this merged schema is used to read the JSON data, it doesn't have a $schema field, and as such that field in the JSON data is lost.

In MEP, we don't really need to merge the JSONSchema with the Hive table schema anymore to read the JSON data anymore, so we should probably stop doing that eventually. However we do need to merge and normalize the event schema with the Hive table schema, in order to successfully write into it. If we do this now, we'd end up with a DataFrame that has two _schema fields in it: one all NULL from Hive, and one with real schema URIs from raw data.

We definitely need to avoid normalizing the schema before reading. Once we do that, @joal and I came up with possible solutions to solve the double _schema field problem.

  • Easy fix: Drop all _schema columns from all Hive tables. The next time data is refined, the actual $schema will be read in with real values, and then the field name will be normalized to _schema before writing.
  • Correct fix: When merging, keep track of what fields get name changes due to normalization, and drop any columns from the Hive side DataFrame that will be normalized. This effectively chooses the input side normalized column over the Hive side one.

The correct fix sounds great, but will be difficult to implement for nested struct fields. We have to somehow recurse into a DataFrame and rebuild it using a new schema with a different number of fields, or figure how to normalize by recursively renaming columns in a DataFrame, not just a StructType schema like we do now.

Event Timeline

Milimetric moved this task from Incoming to Event Platform on the Analytics board.

@JAllemandou I thought I could get dropping struct columns to work like:

val newMetaCol = struct("meta", df0.select("meta.*").drop("topic").columns:_*)
df0.withColumn("meta", newMetaCol)

But it doesn't work. I think maybe the withColumn function that allows you to work with Columns of the same DataFrame doesn't work with a Column created with the struct function.

I wrote ^ back on June 18 but did not submit the comment.

Change 613251 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Don't merge Hive schema by default when reading input data

https://gerrit.wikimedia.org/r/613251

Actually, I think https://gerrit.wikimedia.org/r/613251 should just work, at least for _schema.

For non EventLogging metawiki schemas, we can be sure that each new schema is backwards compatible, and we are always using the latest schema anyway. So, we can just avoid merging the loaded latest event schema with the Hive schema when we read the data. This will read the JSON data with $schema, and then later DataFrameToHive will normalize and convert it to _schema, which will be inserted properly into the Hive table. Old data will of course still have NULL _schema fields, but new data will refine $schema all the way through properly.

We do still need to merge with the Hive schema for the refine_eventlogging_analytics job that reads from metawiki. Even though we tell people there not to make backwards incompatible changes, there is nothing stopping them from doing it. We don't want a repeat of T226219.

Once T238230: Decommission EventLogging backend components by migrating to MEP is fully done, we can remove support for merging during read from Refine altogether.

Am I missing something @JAllemandou? I thought we talked about it being much more complicated than this, but I can't quite remember why.

Knowing that we're moving EL to MEP, I think we're ok with the current situation :)

Change 613251 merged by Ottomata:
[analytics/refinery/source@master] Refine - Don't merge Hive schema by default when reading input data

https://gerrit.wikimedia.org/r/613251

Change 615217 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Don't use merged Hive + event schema when reading raw event data

https://gerrit.wikimedia.org/r/615217

Change 615217 merged by Ottomata:
[operations/puppet@production] Don't use merged Hive + event schema when reading raw event data

https://gerrit.wikimedia.org/r/615217

Mentioned in SAL (#wikimedia-analytics) [2020-07-21T13:36:19Z] <ottomata> Refine no longer merges with Hive table schema when reading (except for refine_eventlogging_analytics job) - T255818

Change 615231 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Quote SQL columns used in selectExpr in TransformFunctions

https://gerrit.wikimedia.org/r/615231

Mentioned in SAL (#wikimedia-analytics) [2020-07-21T14:58:28Z] <ottomata> Refine - reverted change to not merge hive schema + event schema before reading - T255818

Change 615231 merged by Ottomata:
[analytics/refinery/source@master] Refine - Quote SQL columns used in selectExpr in TransformFunctions

https://gerrit.wikimedia.org/r/615231

Change 618141 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refine job refinery version to 0.0.132 to fix $schema field bug

https://gerrit.wikimedia.org/r/618141

Change 618141 merged by Ottomata:
[operations/puppet@production] Bump refine job refinery version to 0.0.132 to fix $schema field bug

https://gerrit.wikimedia.org/r/618141

Ok! Seems to be working now, phew!

Change 619496 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

Change 619496 merged by Ottomata:
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

Mentioned in SAL (#wikimedia-analytics) [2020-08-11T17:36:04Z] <ottomata> refine with refinery-source 0.0.132 and merge_with_hive_schema_before_read=true - T255818

We will tackle this problem once we move to spark 3 to make sure we can fix struts and map types

I just re-read some context on this ticket, and I wanted to write it down so it doesn't get lost.

This change is implemented, but not enabled. Enabling it was reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/618825/, as it caused issues when schemas had complex nested types as described in T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs.

Joseph has a fix that will work in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/619034, but to work with Map types, it needs Spark 3.

So, we are waiting for Spark 3 to merge that change, and then we can re-enable this, and then finally $schema will be populated correctly.

@Ottomata: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action...Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Not yet, still waiting on spark 3 upgrade. See last comment.

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!