Refine drops $schema field values
Open, HighPublic
Actions

Assigned To

None

Authored By

	Ottomata
	Jun 18 2020, 7:33 PM

Description

In order to do our best at handling backwards incompatible data in EventLogging, when reading raw input JSON data during Refine, we do so with a schema made by merging the Hive tables schema + event JSONSchema. We assume that the Hive table has all potential fields it has ever seen, and the JSONSchema might have new ones.

When we merge the schemas, we also normalize them to avoid casing (and other) differences between SQL and non-SQL systems. Our field normalization also converts SQL incompatible chars in field names to '_'. So, $schema becomes _schema. When this merged schema is used to read the JSON data, it doesn't have a $schema field, and as such that field in the JSON data is lost.

In MEP, we don't really need to merge the JSONSchema with the Hive table schema anymore to read the JSON data anymore, so we should probably stop doing that eventually. However we do need to merge and normalize the event schema with the Hive table schema, in order to successfully write into it. If we do this now, we'd end up with a DataFrame that has two _schema fields in it: one all NULL from Hive, and one with real schema URIs from raw data.

We definitely need to avoid normalizing the schema before reading. Once we do that, @joal and I came up with possible solutions to solve the double _schema field problem.

Easy fix: Drop all _schema columns from all Hive tables. The next time data is refined, the actual $schema will be read in with real values, and then the field name will be normalized to _schema before writing.

Correct fix: When merging, keep track of what fields get name changes due to normalization, and drop any columns from the Hive side DataFrame that will be normalized. This effectively chooses the input side normalized column over the Hive side one.

The correct fix sounds great, but will be difficult to implement for nested struct fields. We have to somehow recurse into a DataFrame and rebuild it using a new schema with a different number of fields, or figure how to normalize by recursively renaming columns in a DataFrame, not just a StructType schema like we do now.

Details

Subject	Repo	Branch	Lines +/-
Refine - bump version to 0.0.132, but default to not merging Hive schemas	operations/puppet	production	+10 -8
Bump refine job refinery version to 0.0.132 to fix $schema field bug	operations/puppet	production	+1 -1
Refine - Quote SQL columns used in selectExpr in TransformFunctions	analytics/refinery/source	master	+4 -4
Don't use merged Hive + event schema when reading raw event data	operations/puppet	production	+12 -7
Refine - Don't merge Hive schema by default when reading input data	analytics/refinery/source	master	+102 -62

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Duplicate	None	T291465 Analytics-test-hadoop Spark3 package upgrade
Duplicate	None	T291466 Analytics-hadoop Spark3 package upgrade (production)
Resolved	JAllemandou	T306955 Spark3 migration - Currently existing airflow jobs
Open	None	T291386 Upgrade Refinery Jobs to Spark 3
Open	None	T255818 Refine drops $schema field values
Open	None	T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs
Open	None	T366487 Event Platform schemas should not support type changes to structs as array element or map value types

Event Timeline

Ottomata created this task.Jun 18 2020, 7:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 18 2020, 7:33 PM

Milimetric triaged this task as High priority.Jun 22 2020, 4:21 PM

Milimetric moved this task from Incoming to Event Platform on the Analytics board.

@JAllemandou I thought I could get dropping struct columns to work like:

val newMetaCol = struct("meta", df0.select("meta.*").drop("topic").columns:_*)
df0.withColumn("meta", newMetaCol)

But it doesn't work. I think maybe the withColumn function that allows you to work with Columns of the same DataFrame doesn't work with a Column created with the struct function.

I wrote ^ back on June 18 but did not submit the comment.

Ottomata added a project: Event-Platform.Jul 16 2020, 6:19 PM

Ottomata added a project: Analytics-Kanban.

Change 613251 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Don't merge Hive schema by default when reading input data

https://gerrit.wikimedia.org/r/613251

gerritbot added a project: Patch-For-Review.Jul 16 2020, 7:39 PM

Actually, I think https://gerrit.wikimedia.org/r/613251 should just work, at least for _schema.

For non EventLogging metawiki schemas, we can be sure that each new schema is backwards compatible, and we are always using the latest schema anyway. So, we can just avoid merging the loaded latest event schema with the Hive schema when we read the data. This will read the JSON data with $schema, and then later DataFrameToHive will normalize and convert it to _schema, which will be inserted properly into the Hive table. Old data will of course still have NULL _schema fields, but new data will refine $schema all the way through properly.

We do still need to merge with the Hive schema for the refine_eventlogging_analytics job that reads from metawiki. Even though we tell people there not to make backwards incompatible changes, there is nothing stopping them from doing it. We don't want a repeat of T226219.

Once T238230: Decommission EventLogging backend components by migrating to MEP is fully done, we can remove support for merging during read from Refine altogether.

Am I missing something @JAllemandou? I thought we talked about it being much more complicated than this, but I can't quite remember why.

Knowing that we're moving EL to MEP, I think we're ok with the current situation :)

Change 613251 merged by Ottomata:
[analytics/refinery/source@master] Refine - Don't merge Hive schema by default when reading input data

https://gerrit.wikimedia.org/r/613251

Maintenance_bot removed a project: Patch-For-Review.Jul 17 2020, 4:10 PM

Ottomata claimed this task.Jul 21 2020, 1:19 PM

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.

Change 615217 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Don't use merged Hive + event schema when reading raw event data

https://gerrit.wikimedia.org/r/615217

gerritbot added a project: Patch-For-Review.Jul 21 2020, 1:27 PM

Change 615217 merged by Ottomata:
[operations/puppet@production] Don't use merged Hive + event schema when reading raw event data

https://gerrit.wikimedia.org/r/615217

Mentioned in SAL (#wikimedia-analytics) [2020-07-21T13:36:19Z] <ottomata> Refine no longer merges with Hive table schema when reading (except for refine_eventlogging_analytics job) - T255818

Maintenance_bot removed a project: Patch-For-Review.Jul 21 2020, 2:10 PM

Change 615231 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] Refine - Quote SQL columns used in selectExpr in TransformFunctions

https://gerrit.wikimedia.org/r/615231

gerritbot added a project: Patch-For-Review.Jul 21 2020, 2:47 PM

Mentioned in SAL (#wikimedia-analytics) [2020-07-21T14:58:28Z] <ottomata> Refine - reverted change to not merge hive schema + event schema before reading - T255818

Change 615231 merged by Ottomata:
[analytics/refinery/source@master] Refine - Quote SQL columns used in selectExpr in TransformFunctions

https://gerrit.wikimedia.org/r/615231

Change 618141 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Bump refine job refinery version to 0.0.132 to fix $schema field bug

https://gerrit.wikimedia.org/r/618141

Change 618141 merged by Ottomata:
[operations/puppet@production] Bump refine job refinery version to 0.0.132 to fix $schema field bug

https://gerrit.wikimedia.org/r/618141

Ottomata moved this task from In Progress to Done on the Analytics-Kanban board.Aug 3 2020, 8:16 PM

Ok! Seems to be working now, phew!

Ottomata mentioned this in T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs.Aug 7 2020, 8:37 PM

Ottomata moved this task from Done to In Progress on the Analytics-Kanban board.Aug 7 2020, 8:47 PM

Ottomata mentioned this in T259944: NULL-values for useragent column in event.searchsatisfaction.Aug 10 2020, 1:45 PM

Change 619496 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

Change 619496 merged by Ottomata:
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

Mentioned in SAL (#wikimedia-analytics) [2020-08-11T17:36:04Z] <ottomata> refine with refinery-source 0.0.132 and merge_with_hive_schema_before_read=true - T255818

mforns moved this task from In Progress to Paused on the Analytics-Kanban board.Sep 14 2020, 4:01 PM

We will tackle this problem once we move to spark 3 to make sure we can fix struts and map types

• fdans removed a project: Analytics-Kanban.Oct 26 2020, 4:45 PM

Ottomata added a project: Data-Engineering.Oct 21 2021, 5:20 PM

mforns moved this task from Incoming (new tickets) to Transform on the Data-Engineering board.Nov 1 2021, 4:58 PM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:32 AM

Restricted Application added a project: Analytics. · View Herald TranscriptJan 12 2022, 12:32 AM

Ottomata added a parent task: T291386: Upgrade Refinery Jobs to Spark 3.May 11 2022, 5:41 PM

I just re-read some context on this ticket, and I wanted to write it down so it doesn't get lost.

This change is implemented, but not enabled. Enabling it was reverted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/618825/, as it caused issues when schemas had complex nested types as described in T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs.

Joseph has a fix that will work in https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/619034, but to work with Map types, it needs Spark 3.

So, we are waiting for Spark 3 to merge that change, and then we can re-enable this, and then finally $schema will be populated correctly.

JArguello-WMF removed a project: Analytics.Jul 6 2022, 2:55 PM

Ottomata mentioned this in T312016: Increase EditAttemptStep sampling rate(s) to 100%.Oct 31 2022, 1:55 PM

cjming mentioned this in T309602: VisualEditorFeatureUse Migration to MP.Nov 9 2022, 6:58 PM

lbowmaker moved this task from Backlog to Investigate on the Event-Platform board.Nov 16 2022, 4:43 PM

Ottomata moved this task from Investigate to Backlog on the Event-Platform board.Jan 11 2023, 3:32 PM

@Ottomata: Hi, all related patches in Gerrit have been merged. Can this task be resolved (via Add Action... → Change Status in the dropdown menu), or is there more to do in this task? Asking as you are set as task assignee. Thanks in advance!

Aklapper removed a project: Patch-For-Review.Feb 21 2023, 10:08 PM

Not yet, still waiting on spark 3 upgrade. See last comment.

@Ottomata: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome! :)
If this task has been resolved in the meantime, or should not be worked on by anybody ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!

xcollazo subscribed.Jun 6 2023, 3:24 PM

Ottomata mentioned this in T335308: Migrate Refine to Spark 3.Jun 15 2023, 8:56 PM

JArguello-WMF moved this task from Transform to Datasets on the Data-Engineering board.Jun 29 2023, 11:33 PM

JArguello-WMF moved this task from Datasets to Event Platform Backlog on the Data-Engineering board.

JArguello-WMF added a project: Data Engineering and Event Platform Team.Jun 30 2023, 4:30 PM

JArguello-WMF moved this task from Data Eng Backlog to Event Platform Backlog on the Data Engineering and Event Platform Team board.Jun 30 2023, 4:38 PM

lbowmaker removed a project: Data Engineering and Event Platform Team.Nov 10 2023, 2:29 PM

Ottomata added a subtask: T259924: HiveExtensions.convertToSchema does not properly convert arrays of structs.Dec 12 2023, 5:23 PM

Ottomata mentioned this in T356762: [Refine refactoring] Extract refine schema management into a dedicated tool.Feb 8 2024, 3:33 PM