Page MenuHomePhabricator

CentralNoticeImpression refined impressionEventSampleRate is int instead of double
Closed, ResolvedPublic

Description

When we refine EventLogging schemas for insertion into Hive tables, we infer the type of fields. In the case of impressionEventSampleRate we inferred integer but in the schema it's set to "number". In the future, we will use the schema directly, but for now we're just monitoring where inferring goes wrong. All rows have a value of 0 for this property. In the code it looks like it was set to 0.01. We could alter the table and correct the data by always setting it to 0.01. Let us know if that's the right thing to do or if there's any other nuance we're missing.

Event Timeline

Milimetric triaged this task as Medium priority.Feb 28 2019, 5:43 PM
Milimetric moved this task from Incoming to Data Quality on the Analytics board.
Milimetric removed a project: Analytics-Kanban.

@DStrine can you let us know what you'd like to do here? It's not technically complicated, but it's a little time sensitive in case you want to look at raw EventLogging data (which gets dropped after 90 days without a whitelist policy)

Hi! Thanks so much!!!

Here's what Hive said about the event field in the event/centralnoticeimpression table:

event   struct<anonymous:boolean,banner:string,bannerCategory:string,bucket:bigint,campaign:string,
campaignCategory:string,campaignCategoryUsesLegacy:boolean,country:string,db:string,debug:boolean,
device:string,impressionEventSampleRate:bigint,project:string,randombanner:double,randomcampaign:double,
recordImpressionSampleRate:double,result:string,status:string,statusCode:string,uselang:string,
reason:string,bannerCanceledReason:string,bannersNotGuaranteedToDisplay:boolean,debugInfo:string,
errorMsg:string,alterFunctionMissing:boolean,region:string>

I don't see anything else problematic, other than impressionEventSampleRate should be double. bucket could be tinyint if you wish.

The data is currently not in use, no issues currently about the sunsetting data. If it's very little to go back and change the 0's to 0.01, that might be useful, so we can compare the data from the old pipeline (that this will replace) to this new data when we get ready to switch. However, it's also fine to just have that field set correctly going forward, too.

Thanks again!!!!

Let's see, this data comes from eventlogging, in order for it to be useful we need to make sure FR-tech has switched to eventlogging being the main way by which impressions are computed, has that happened?

Let's see, this data comes from eventlogging, in order for it to be useful we need to make sure FR-tech has switched to eventlogging being the main way by which impressions are computed, has that happened?

No, that hasn't happened yet. The events have been left on at 0.01% sample rate (hope that's OK) but the data is not being used yet. Work to finish the new pipeline should continue soon, then we'll compare the data form both sources, and switch in the new pipeline once it's confirmed to be all good.

The events have been left on at 0.01% sample rate (hope that's OK)

Yes, of course. Once you are ready to switch pipelines let us know.

The easiest thing to do is to delete the old data and change the schema going forward. Let me know if this is ok to do, @AndyRussG. If not, I can do a more painful copy/rename/rename thing to keep the old data.

ping @AndyRussG, can you confirm that it's ok to delete the old data?

Sorry for the delay...!! I thinks so... @Seddon, ok with you also that the existing data in Hive obtained from the new pipeline be deleted?

Seems like the right thing so that we can move forward!

@Addshore & @Verena: Be aware there may be some disruption to the CN hive data.

Ok, done, and now I'm seeing "impressionEventSampleRate":0.01 in the data, so all is good going forward. Thanks for getting back to us and helping move this forward.

Milimetric moved this task from In Progress to Done on the Analytics-Kanban board.

for the record, we made a teeny mistake the first time we did this and the useragent field had a bad schema. So we redid it today and data and schema both look fine now.