HiveExtensions.convertToSchema does not properly convert arrays of structs
Open, HighPublic
Actions

Assigned To

None

Authored By

	Ottomata
	Aug 7 2020, 8:26 PM

Description

StructTypes need to be recursively merged and reordered via SQL. This is done for StructType fields, but it is not done for ArrayTypes with elementType == StructType.

Since no checking is done for ArrayType, the SQL for an array of struct field will look like:

CAST(d AS ARRAY<STRUCT<`db_string`: STRING, `da_long`: BIGINT>>) AS d

Which will not work if the source schema has the struct fields in a different order, e.g. db_long, db_string.

This is causing bad refined data for mediawiki_cirrussearch_request, as reported in https://phabricator.wikimedia.org/P12200.

We need to make convertToSchema smarter about converting an array of structs.

Details

Subject	Repo	Branch	Lines +/-
Fix convertToSchema to work with array of structs	analytics/refinery/source	master	+98 -39
Refine - bump version to 0.0.132, but default to not merging Hive schemas	operations/puppet	production	+10 -8
Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug"	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T291464 Upgrade analytics-hadoop to Spark 3 + scala 2.12
Duplicate	None	T291465 Analytics-test-hadoop Spark3 package upgrade
Duplicate	None	T291466 Analytics-hadoop Spark3 package upgrade (production)
Resolved	JAllemandou	T306955 Spark3 migration - Currently existing airflow jobs
Open	None	T291386 Upgrade Refinery Jobs to Spark 3
Open	None	T255818 Refine drops $schema field values
Open	None	T259924 HiveExtensions.convertToSchema does not properly convert arrays of structs
Open	None	T366487 Event Platform schemas should not support type changes to structs as array element or map value types

Event Timeline

Ottomata created this task.Aug 7 2020, 8:26 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 7 2020, 8:26 PM

Change 619034 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] [WIP] Fix for convertToSchema with array of structs

https://gerrit.wikimedia.org/r/619034

gerritbot added a project: Patch-For-Review.Aug 7 2020, 8:28 PM

Ottomata updated the task description. (Show Details)Aug 7 2020, 8:29 PM

Attempt at https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/619034, but it doesn't quite work.

That changes the SQL to:

ARRAY(NAMED_STRUCT('db_string', d.db_string, 'da_long', d.da_long)) AS d

Which seems correct to me, but I am getting:

cannot resolve 'CAST(t_2c1aea2d83d849f5a74834554454c39d.`d` AS BIGINT)' due to data type mismatch: cannot cast array<struct<da_long:bigint,db_string:string>> to bigint; line 1 pos 15;
'Project [b#5L AS b#13L, cast(d#7 as bigint) AS d#14, named_struct(aa, cast(a#4.aa as array<int>), ab, cast(a#4.ab as map<string,string>), ad, cast(null as bigint), ac, a#4.ac, ae, array(named_struct(aea, cast(a#4.ae.aea as string))), af, cast(a#4.af as map<struct<afa:bigint>,struct<afb:bigint,afc:string>>)) AS a#15, c#6 AS c#16, array(named_struct(db_string, d#7.db_string, da_long, d#7.da_long)) AS d#17]

which I don't yet understand. Why does it think I want to cast d to a BIGINT? I see in the project: array(named_struct(db_string, d#7.db_string, da_long, d#7.da_long)) AS d#17, so I dunno.

This was noticed by @EBernhardson this week as I merged the fix for T255818: Refine drops $schema field values on Monday. I'm no longer merging (and properly reordering?) the struct fields at read time with the Hive schema, so the bug manifests itself as I'm now calling convertToSchema from a DataFrame with its struct fields our of order in the array.

I think I'm going to have to revert and backfill mediawiki_cirrussearch_requeust since Monday. I don't know if this is affecting other data.

Change 618825 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug"

https://gerrit.wikimedia.org/r/618825

Change 618825 merged by Ottomata:
[operations/puppet@production] Revert "Bump refine job refinery version to 0.0.132 to fix $schema field bug"

https://gerrit.wikimedia.org/r/618825

Launching backfill:

sudo -u analytics /usr/bin/spark2-submit \
--name refine_event_backfill_cirrussearch_request \
--class org.wikimedia.analytics.refinery.job.refine.Refine \
--files /etc/hive/conf/hive-site.xml,/etc/refinery/refine/refine_event.properties,/srv/deployment/analytics/refinery/artifacts/hive-jdbc-1.1.0-cdh5.10.0.jar,/srv/deployment/analytics/refinery/artifacts/hive-service-1.1.0-cdh5.10.0.jar --master yarn --deploy-mode cluster --queue production --driver-memory 8G --executor-memory 4G --conf spark.driver.extraClassPath=/usr/lib/hadoop-mapreduce/hadoop-mapreduce-client-common.jar:hive-jdbc-1.1.0-cdh5.10.0.jar:hive-service-1.1.0-cdh5.10.0.jar --conf spark.dynamicAllocation.maxExecutors=64  \
--principal analytics/an-launcher1002.eqiad.wmnet@WIKIMEDIA --keytab /etc/security/keytabs/analytics/analytics.keytab \
/srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.0.129.jar \
--config_file refine_event.properties --table_whitelist_regex=mediawiki_cirrussearch_request --since=2020-08-03T13:00:00 --until=2020-08-07T20:00:00 --ignore_done_flag=true --ignore_failure_flag=true

application_1596639839773_18212

I don't know if this is affecting other data.

Is the hits field the one affected? https://schema.wikimedia.org/repositories//primary/jsonschema/mediawiki/cirrussearch/request/0.0.1.yaml

hits:
   description: Final set of result pages returned for the CirrusSearch request
   type: array
   items:
     type: object
     additionalProperties: false
     properties:
       page_title:
         description: MediaWiki page title of the result
         type: string
       page_id:
         description: MediaWiki page id of the result. May be -1 for interwiki results
         type: integer
       index:
         description: ElasticSearch index this result came from
         type: string
       score:
         description: Score from ElasticSearch for this result
         type: number
       profile_name:
         description: The profile name for comp_suggest queries
         type: string

which I don't yet understand. Why does it think I want to cast d to a BIGINT?

Ah, because I had a field in the test already called d, doh.

Ok getting closer, but something is still not quite right with my fix. Calling it a day.

I'll check on the backfill on Monday.

Is the hits field the one affected?

In this case it is the elasticsearch_requests[].hits_returned vs elasticsearch_requests[].query (and also other fields out of order). The value of hits_returned (an int) is being stuck in the query field and cast to a string.

Ottomata mentioned this in T259944: NULL-values for useragent column in event.searchsatisfaction.Aug 8 2020, 2:25 AM

20/08/08 02:30:22 INFO Refine: Successfully refined 103 of 103 dataset partitions into table `event`.`mediawiki_cirrussearch_request` (total # refined records: 1129822940)

Looks better:

select elasticsearch_requests from mediawiki_cirrussearch_request where year=2020 and month=8 and day=3 and hour=17 limit 10\G
...
elasticsearch_requests | [{query=The plot against , query_type=comp_suggest, indices=[enwiki_titlesuggest], namespaces=[0], request_time_ms=27, search_time_ms=0, limit=null, hits_total=14, hits_returned=6, hits_offset=0, ...

mforns assigned this task to Ottomata.Aug 10 2020, 3:49 PM

mforns triaged this task as High priority.

mforns moved this task from Incoming to Smart Tools for Better Data on the Analytics board.

mforns added a project: Analytics-Kanban.

Ottomata moved this task from Next Up to In Progress on the Analytics-Kanban board.Aug 10 2020, 10:21 PM

AHHHH RATS. I don't think Refine can do this, at least not with reading the incoming data with the merged Hive schema. ARRAY doesn't do what I had hoped, of course (it just makes an array of something, it isn't a 'named_struct' type of function.

I've tried many many different ways of using SQL to select from one array of structs into another ordered by field name, but I have not succeeded.

I think we need to find another solution. I'll follow up with Joseph here when he gets back.

Can you use spark higher order functions, particularly transform(array<T>, function<T, U>): array<U> ? This effectively maps a function over the array, which could be a named struct?

Change 619496 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

Change 619496 merged by Ottomata:
[operations/puppet@production] Refine - bump version to 0.0.132, but default to not merging Hive schemas

https://gerrit.wikimedia.org/r/619496

In T259924#6376540, @EBernhardson wrote:

Can you use spark higher order functions, particularly transform(array<T>, function<T, U>): array<U> ? This effectively maps a function over the array, which could be a named struct?

Good call @EBernhardson :)

JAllemandou moved this task from In Progress to In Code Review on the Analytics-Kanban board.Aug 26 2020, 11:35 AM

mforns moved this task from In Code Review to Paused on the Analytics-Kanban board.Sep 14 2020, 4:02 PM

• fdans removed a project: Analytics-Kanban.Oct 26 2020, 4:45 PM

From what I gathered in the code reviews, this is easier in Spark 3 and we're waiting for that, correct?

I can't 100% recall but I believe that is correct.

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 4:17 AM

odimitrijevic moved this task from Incoming (new tickets) to Analyze on the Data-Engineering board.Jan 6 2022, 5:07 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:35 AM

Ottomata added a parent task: T291386: Upgrade Refinery Jobs to Spark 3.May 11 2022, 5:43 PM

Ottomata mentioned this in T255818: Refine drops $schema field values.Jun 9 2022, 2:31 PM

Aklapper edited projects, added Patch-Needs-Improvement; removed Patch-For-Review.Feb 21 2023, 10:14 PM