Page MenuHomePhabricator

Productionize banner impressions druid/pivot dataset
Closed, ResolvedPublic8 Estimated Story Points

Description

In the WMDS we Analytics collaborated with FR-Tech to put a POC of banner impression data into Druid.

The result was amazing and it would be very interesting to productionize this dataset (implement oozie job and stuff) to have it updated daily, given that the data size is small and its value for the FR-Tech team is huge.

Event Timeline

If possible, it will be great for fr-tech to commit work to get this done, after an initial poc we prefer to teach teams how to use oozie to set this type of jobs rather than having analytics team do it for all teams. We have collaborated in this fashion with discovery team with good results.

Update: Analytics to drive productionization of this work as @mforns availability allows.

Change 331794 had a related patch set uploaded (by Mforns):
Add banner impressions jobs

https://gerrit.wikimedia.org/r/331794

If possible, it will be great for fr-tech to commit work to get this done, after an initial poc we prefer to teach teams how to use oozie to set this type of jobs rather than having analytics team do it for all teams. We have collaborated in this fashion with discovery team with good results.

Sounds great! Thanks so much!!! :)

I successfully uploaded data for all of December. Looks good!!

I think it'd be quite easy to productionize this so the data accrues daily, or maybe even hourly.

Thanks so much for the help with this!!!!!

(BTW, this is the tool that I'd been hoping to demonstrate last Saturday in the office...)

AndyRussG set the point value for this task to 2.Jan 17 2017, 8:42 PM
ggellerman moved this task from Triage to Sprint +1 on the Fundraising-Backlog board.

@AndyRussG
We talked about this task in our team meeting and agreed that we Analytics should dedicate time to help you guys with productionizing the population of a "fundraising table" plus importing the data to druid daily. So, feel free to set up a pair programming session or meeting with me or send me any questions, code reviews etc. :]

However, if you are interested in having real time updates, this would be a whole other project, and probably we won't be able to work on it in the next months. :/

@AndyRussG : daily accrues are definitely possible. Hourly ones not so much. We could update hours after data is gathered but it might be three or four depending when jobs higher on pipeline are done. Thus, frequency of updates is higher than daily but less than hourly.

  • Oozie work to source data from webrequest into a banner impressions table, indexing job in druid loads data into druid
  • Add step to worflow that explicitily deletes the temporary files.

Thus far looks like this job doesn't have any sensitive data.

Nuria changed the point value for this task from 2 to 8.Jan 19 2017, 5:26 PM

Hi all,

Just joining this work here. I've seen this tool in action and it looks pretty cool!! I love dev summit meetings and hackathons for this very reason.

@AndyRussG , fr-tech and I still have some bugs left over from our December campaign and we have some Q3 goals we need to kickoff. We could lend a hand to "productionize" this relatively soon (maybe in a couple weeks). We just need to chat with our stakeholders. I might have more info later next week.

@DStrine: actually we are going to go ahead and productionize it ourselves, just try to see what is going on for further modifications down the line

@Nuria ok, that's cool. Thanks for the help!

@DStrine: We will likely can take this work item next week, you can follow progress on kanban: http://phabricator.wikimedia.org/analytics-kanban/

@AndyRussG
What are your thoughts about @JAllemandou 's comments on the patch? Would a minutely resolution be interesting for you? Is the region field very important or could be discarded in favor of real time banner_impression data? Thanks!

I think the implementation of the jobs is finished now. I've made some changes to the patch:

  • banner impressions renamed to banner activity
  • metric names renamed to request_count and normalized_request_count
  • minutely resolution
  • numShards = 1 for daily job, = 2 for monthly job
  • job start time = 2016-12-01
  • improve comments

and I've tested it works as expected. I'm importing now the whole month of December 2016 into a new data set called banner_activity_minutely.

Hi, all! Thanks so much once again for working on this!!!! :D Here are some notes and questions:

  • Region data is important. I've made a patch to add that onto the beacon/impression URL (see T156399) but for backfill, it'd be great to keep using the data available in Hive.
  • If it's possible to also have minutely resolution, that'd be fantastic.
  • Would it be possible to add to the backfill any days in November that haven't yet been purged? The year-end English fundraiser started on November 29th, so I think we're not too late to backfill data for the whole fundraiser...
  • What are your thoughts on a retention policy for the data in Druid?
  • Re: the name, how about "centralnotice_activity_minutely"? Or just "centralnotice_minutely"? Since some of the statuses don't even involve loading any banners... though, I suppose "banner" might be accessible to more people... (I've no objections whatever you choose.)

Thanks again!!

Mmm it seems I may have spoken too quickly about region vs. minutely, apologies... If it turns out that it's necessary to choose, some discussion might be needed...

@AndyRussG @JAllemandou

I have added the 'country_matches_geocode' field to the banner_activity jobs, as per your suggestion. See change.
I have tested the daily job with January data, and it looks good! As @AndyRussG calculated, non-matching countries represent roughly 0.3% of the total. Druid and Pivot also react without problems to the fact that different time-spans have different schemas. So, yay!

Now, I have NOT tested the monthly job yet. Basically, I think it is too risky to test it with December 2016, because we only have one shot. Tomorrow the data for 1st of December will be erased. And if the test goes wrong, we won't have time to re-run the job, and would loose the data that we have. Also, we can not populate a new dataset, because we'd loose the last days of November, that don't exist any more in the webrequest table.

I would leave November and December data as it is (without country_matches_geocode) to avoid any potential issues. And tomorrow test the monthly job on January 2017. If it runs OK, we can merge the patch and deploy. The country_matches_geocode field would start to be active on Jan 1st 2017.

Let me know if that makes sense! :]

Works for me @mforns, it seems indeed the best choice you made (as usual ;)

BTW, we can pause deletion jobs for a bit without violating privacy policy, if that helps!

Change 335237 had a related patch set uploaded (by Ottomata):
Temporarily increase refined webrequest retention to 90 days

https://gerrit.wikimedia.org/r/335237

Ok, done in https://gerrit.wikimedia.org/r/#/c/335237/

I've extended the refined webrequest retention to 90 days. We need to set this back to 62 once we're done computing the banner impressions stuff.

Change 335237 merged by Ottomata:
Temporarily increase refined webrequest retention to 90 days

https://gerrit.wikimedia.org/r/335237

Change 335796 had a related patch set uploaded (by Mforns):
Add config for banner activity pivot data set

https://gerrit.wikimedia.org/r/335796

Change 335796 merged by Elukey:
Add config for banner activity pivot data set

https://gerrit.wikimedia.org/r/335796

Change 331794 merged by Joal:
Add banner activity oozie jobs

https://gerrit.wikimedia.org/r/331794

@AndyRussG @mforns @JAllemandou I thought we had this processing at near real time but this now seems to have regressed to being batch processed every 24 hours?

Was there a change?

Hi @Jseddon ,
We setup a (non-production) near-realtime job a while ago indeed.
Couple of weeks ago, we upgraded our cluster to a new version, and that job failed.
While it's on my list of things to do, I currently have more urgent stuff on my plate.
I hope I'll be able to fix it soon.

I also wish to point the fact the productionized code is at daily updates, and the near-realtime is more of a goodie :)

@Jseddon : near-realtime job restarted successfully, enjoy your goodie ;)