Page MenuHomePhabricator

Refining is failing to refine centranoticeimpression events
Closed, ResolvedPublic

Description

Central notice event impressions are flowing and look ok in the raw data but are not being refined correctly:
See schema:
https://meta.wikimedia.org/wiki/Schema:CentralNoticeImpression

REFINED DATA:

hive (event)> select * from CentralNoticeImpression where year=2019 and month=12 and day=15 and hour=10 limit 10;
OK
ip	useragent	uuid	seqid	dt	wiki	webhost	schema	revision	topic	recvfrom	event	geocoded_data	year	month	day	hour
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-pig-bundle-1.5.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-hadoop-bundle-1.5.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/parquet-format-2.1.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-exec-1.1.0-cdh5.16.1.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.16.1-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
NULL	{"browser_family":null,"browser_major":null,"browser_minor":null,"device_family":null,"is_bot":null,"is_mediawiki":null,"os_family":null,"os_major":null,"os_minor":null,"wmf_app_version":null}	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	{"anonymous":null,"project":null,"db":null,"uselang":null,"device":null,"country":null,"region":null,"debug":null,"randomcampaign":null,"randombanner":null,"recordImpressionSampleRate":null,"impressionEventSampleRate":null,"status":null,"statusCode":null,"campaignCategory":null,"campaign":null,"banner":null,"campaignCategoryUsesLegacy":null,"bannerCategory":null,"bucket":null,"bannersNotGuaranteedToDisplay":null,"bannerCanceledReason":null,"result":null,"reason":null,"requestedBanner":null,"alterFunctionMissing":null,"banner_count":null,"errorMsg":null,"testIdentifiers":null,"debugInfo":null,"campaignStatuses":null}	{"city":"Unknown","latitude":"-1.0","timezone":"Unknown","country":"Unknown","longitude":"-1.0","continent":"Unknown","country_code":"--","subdivision":"Unknown","postal_code":"Unknown"}	2019	12	15	10
Time taken: 0.863 seconds, Fetched: 1 row(s)

RAW DATA (for that same hour)

1576404306000 {"dt": "2019-12-15T10:05:06Z", "event": {"alterFunctionMissing": true, "anonymous": true, "banner": "B19_WMDE_Mobile_04_var", "bannerCategory": "fundraising", "bannersNotGuaranteedToDisplay": true, "bucket": 0, "campaign": "C19_WMDE_Mobile_Test_04", "campaignCategory": "fundraising", "campaignCategoryUsesLegacy": true, "campaignStatuses": "[{\"statusCode\":\"6\",\"campaign\":\"C19_WMDE_Mobile_Test_04\",\"bannersCount\":2}]", "country": "DE", "db": "dewiki", "debug": false, "device": "android", "impressionEventSampleRate": 0.01, "project": "wikipedia", "randombanner": 0.5129977427490309, "randomcampaign": 0.2397016272995669, "recordImpressionSampleRate": 1, "region": "BY", "result": "show", "status": "banner_shown", "statusCode": "6", "uselang": "de"}, "ip": "some, "recvFrom": "cp3050.esams.wmnet", "revision": 19511351, "schema": "CentralNoticeImpression", "seqId": 39737114, "userAgent": {"browser_family": "Samsung Internet", "browser_major": "10", "browser_minor": "2", "device_family": "Samsung SM-G965F", "is_bot": false, "is_mediawiki": false, "os_family": "Android", "os_major": "9", "os_minor": null, "wmf_app_version": "-"}, "uuid": "563c86f7f0725c3fa831d097790b4432", "webHost": "de.m.wikipedia.org", "wiki": "dewiki"}

Event Timeline

Nuria renamed this task from centralnotice events do not have data to Refining is failing to refine centranoticeimpression events.Feb 10 2020, 6:30 PM
Nuria assigned this task to Ottomata.
Nuria updated the task description. (Show Details)

pinging @DStrine so he knows this is going on.

@AndyRussG Issue can be tracked to this change: https://meta.wikimedia.org/w/index.php?title=Schema%3ACentralNoticeImpression&type=revision&diff=19511351&oldid=19510146

which changed types of a field which is a non backwards compatible change.
Changes such us these normally trigger an alarm refining data in our end but in this case this did not (we are looking into why that would be) In any case case it is important to know that non backwards compatible changes to schemas are not supported.

@AndyRussG changed the type of the campaignStatuses field. He added the field at 15:40, 31 October 2019, and then changed its types 6 hours later. That was enough time for the Refine job to see the new field as it was first added and, and alter the Hive table to add the field as an array of strings. We don't support type changes, so this data is failing to import properly.

I'm not sure why we aren't getting an error about this though. It looks like Spark is somehow auto-converting from a string to an array of strings, but failing to import data later on. Am looking into that.

Change 571365 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery/source@master] [WIP] Warn when merging incompatible types; FAILFAST when reading JSON data with a schema

https://gerrit.wikimedia.org/r/571365

@AndyRussG : what is the plan to move forward, are changes happening to the client that is sending the data or rather we should assume types are going to be what they are now for the foreseeable future?

Ping @jkumalah so she knows that this is going on

thanks for the ping @Nuria . Will follow-up with @AndyRussG for additional insights.

Also, in order to do the archiving requested here by Advancement/FR: {T161656} of this years banner data we need to re-refine

Hi! Thanks much, @Nuria, @Ottomata, @DStrine, @jkumalah :) Apologies for the delay in replying...

Just for context, data from this event is not currently used anywhere. It's for the new data pipeline, work on which is stalled. (I think we may take this up again in March.) Data from the old pipeline (calls to beacon/impression) seems fine, and contains all the same information.

Also, in order to do the archiving requested here by Advancement/FR: {T161656} of this years banner data we need to re-refine

Thanks for considering this! That task is about storing data from a different event (CentralNoticeBannerHistory) which doesn't have any refining issues, as far as I know.

@AndyRussG : what is the plan to move forward, are changes happening to the client that is sending the data or rather we should assume types are going to be what they are now for the foreseeable future?

Mmm yeah good question!! Looks like we need to think about backward compatibility issues, since the plan was to switch the datatype back to array eventually.

For now, if we can just ensure new data refines properly going forward, I think that's fine... I wouldn't recommend spending much effort fixing the bad data currently in Hive. (I imagine that would also be sufficient for WMDE to start looking at using the new data, as per T243092.)

Thanks again!!

since the plan was to switch the datatype back to array eventually.

FYI, if you need a new datatype, you should just make a new field :) It isn't clear how you'd query a field that has two different types at different dates anyway. :)

FYI, if you need a new datatype, you should just make a new field :) It isn't clear how you'd query a field that has two different types at different dates anyway. :)

Ah right, thanks! Yeah good point... I guess we weren't worrying too much about that, since this was not yet considered production data, but we definitely need to take it into account moving forward...

@Ottomata we keep 90 days of raw data right? If so i vote from dropping all rerfined data and re-refined it again.

Let's:

  1. update table in hive to latest schema
  1. refine (if possible) the last 90 days of data

Mentioned in SAL (#wikimedia-analytics) [2020-02-21T16:04:28Z] <ottomata> altered event.CentralNoticeImpression table column event.campaignStatuses to type string, will backfill data - T244771

Change 571365 merged by jenkins-bot:
[analytics/refinery/source@master] Refine - Warn when merging incompatible types; FAILFAST when reading JSON data with a schema

https://gerrit.wikimedia.org/r/571365

@Ottomata: can we alter table to latest schema and re-refine the last 90 days of data?

Selecting from table now I get:

hive (event)> select * from CentralNoticeImpression where year=2019 and month=12 and day=15 and hour=10 limit 10;
OK
ip	useragent	uuid	seqid	dt	wiki	webhost	schema	revision	topic	recvfrom	event	geocoded_data	year	month	day	hour
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveArrayInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector

Ah Nuria, sorry I !log-ged in #wikimedia-analytics IRC but didn't post directly here a status update.

The refine backfill finished this last night:

20/02/25 21:09:46 INFO Refine: Successfully refined 2251 of 2251 dataset partitions into table `event`.`CentralNoticeImpression` (total # refined records: 87333267)

I had tested ALTERing the table and refining in my own db and it worked, so I did the same for event. I see the same error you get. Will investigate.

Mentioned in SAL (#wikimedia-analytics) [2020-02-26T15:06:19Z] <ottomata> dropped and re-added backfilled partitions on event.CentralNoticeImpression table to propogate schema alter on main table - T244771

Ok fixed.

The problem was that even though the table had the correctly ALTERed schema, each pre-existing partition still maintained the old schema, even though the underlying data files had been re-refined with the proper schema in parquet.

We should make Refine DROP IF EXISTS and then ADD PARTITION every time it completes a refinement. Making a task.

Closing as all data is re-refined and accessible.