When we refine EventLogging schemas for insertion into Hive tables, we infer the type of fields. In the case of impressionEventSampleRate we inferred integer but in the schema it's set to "number". In the future, we will use the schema directly, but for now we're just monitoring where inferring goes wrong. All rows have a value of 0 for this property. In the code it looks like it was set to 0.01. We could alter the table and correct the data by always setting it to 0.01. Let us know if that's the right thing to do or if there's any other nuance we're missing.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | mforns | T214384 [Bug] Type mismatch between NavigationTiming EL schema and Hive table schema | |||
Resolved | Milimetric | T216771 [Bug] Type mismatch for a few other schemas | |||
Resolved | Milimetric | T217109 CentralNoticeImpression refined impressionEventSampleRate is int instead of double |
Event Timeline
@DStrine can you let us know what you'd like to do here? It's not technically complicated, but it's a little time sensitive in case you want to look at raw EventLogging data (which gets dropped after 90 days without a whitelist policy)
Hi! Thanks so much!!!
Here's what Hive said about the event field in the event/centralnoticeimpression table:
event struct<anonymous:boolean,banner:string,bannerCategory:string,bucket:bigint,campaign:string, campaignCategory:string,campaignCategoryUsesLegacy:boolean,country:string,db:string,debug:boolean, device:string,impressionEventSampleRate:bigint,project:string,randombanner:double,randomcampaign:double, recordImpressionSampleRate:double,result:string,status:string,statusCode:string,uselang:string, reason:string,bannerCanceledReason:string,bannersNotGuaranteedToDisplay:boolean,debugInfo:string, errorMsg:string,alterFunctionMissing:boolean,region:string>
I don't see anything else problematic, other than impressionEventSampleRate should be double. bucket could be tinyint if you wish.
The data is currently not in use, no issues currently about the sunsetting data. If it's very little to go back and change the 0's to 0.01, that might be useful, so we can compare the data from the old pipeline (that this will replace) to this new data when we get ready to switch. However, it's also fine to just have that field set correctly going forward, too.
Thanks again!!!!
Let's see, this data comes from eventlogging, in order for it to be useful we need to make sure FR-tech has switched to eventlogging being the main way by which impressions are computed, has that happened?
No, that hasn't happened yet. The events have been left on at 0.01% sample rate (hope that's OK) but the data is not being used yet. Work to finish the new pipeline should continue soon, then we'll compare the data form both sources, and switch in the new pipeline once it's confirmed to be all good.
The events have been left on at 0.01% sample rate (hope that's OK)
Yes, of course. Once you are ready to switch pipelines let us know.
The easiest thing to do is to delete the old data and change the schema going forward. Let me know if this is ok to do, @AndyRussG. If not, I can do a more painful copy/rename/rename thing to keep the old data.
Sorry for the delay...!! I thinks so... @Seddon, ok with you also that the existing data in Hive obtained from the new pipeline be deleted?
Ok, done, and now I'm seeing "impressionEventSampleRate":0.01 in the data, so all is good going forward. Thanks for getting back to us and helping move this forward.
for the record, we made a teeny mistake the first time we did this and the useragent field had a bad schema. So we redid it today and data and schema both look fine now.