Page MenuHomePhabricator

Dashboard for CentralNotice impression rates using Druid, centralnotice_analytics and CN events
Open, LowestPublic4 Estimated Story Points

Description

This task is to discuss a proposal for how make a dashboard with CentralNotice impression rates.

Results

The end results would be:

  • Automatic dashboards for Advancement showing impression rates for active Fundraising campaigns.
  • A new Druid dataset on the Analytics cluster with impression rates for all CentralNotice campaigns.

This task would use the centralnotice_analytics library to deal with complex differences in how we identify user segments in pageviews and CentralNotice datasets.

The main limitation would be that that the data would only be accurate for campaigns targeting not-logged-in users. This isn't a problem for Advancement, since we don't target logged-in users. The other main limitation is that it wouldn't initially surface campaigns that are fallen back from. (See below for details.)

Implementation

We'd create a new Druid dataset, centralnotice_impressions_and_pageviews_hourly, with the following dimensions:

DimensionDescription
campaignFinal campaign selected. This would never be empty.
bannerBanner displayed. This could be empty, since we can sometimes have a campaign selected but no banner selected.
status_codeCentralNotice status code. This would never be empty.
countryCountry code.
languageLanguage code. The join would be performed on CN's understanding of language and project language as identified in the project URL.
projectProject, in terms of CentralNotice projects (some of which encompass more than one Wikiemdia project).
deviceDevice, in terms of CentralNotice device categories (desktop, android, ipad, iphone and other).

The measures would be:

MeasureDescription
impressionsFrom banner_activity_minutely, normalized hourly total impressions for the permutation of all values of all dimensions.
pageviewsFrom pageviews_hourly, hourly total pageviews for all the permutations of columns that we can map from banner_activity_minutely (country, language, project and device). This measure will be identical for all cells with identical values in these columns, regardless of the values in the remaining columns (campaign, banner and status_code).
percentage_of_identified_bot_requestsFrom pageviews_hourly, hourly percentage identified bots for all the permutations of columns that we can map from banner_activity_minutely (same as pageviews measure, above).

Queries for a given campaign could be performed as follows:

  • Filter on a given user segment (country, language, project, device). This would usually correspond to the segment targeted by a given campaign.
  • Group by campaign.
  • For each campaign, calculate impressions/pageviews to get impression rate in the specified user segment.

(For accurate bot percentages, I think some weighting per cell might be needed? I'm not totally sure on how this measure works.)

Time granularity would be hourly.

The dataset would be generated more or less as follows:

  • For each hour, find all the values in banner_activity_minutely for campaign, banner and status code.
  • For each hour, find all the values in pageviews_hourly for country, language, project and device (as understood in CentralNotice terms).
  • Loop through all the permutations of all the above values discovered, creating a cell in the new dataset for each, and querying the source datasets for the measures they provide.

This new dataset would reside in Druid on the Analytics cluster.

Then, on the Fundraising cluster, we would create a script that would listen to the CentralNotice events Kafka streams for changes in CentralNotice campaigns. These events already exist and the streams are already available on the FR cluster.

That script would keep track of active CentralNotice fundraising campaigns, and would query the Meta API for campaign setup details.

(Note that much of the code to do this was already written for centralnotice_changes_monitor.)

The script could then configure FR Superset dashboards to display impression rates of active FR campaigns, as per the targeting settings of each campaign. Or, alternately, it could query Druid itself using the same criteria, and provide the results of those queries used for impression rate dashboards.

Initial limitations

Logged-in vs. anonymous

pageviews_hourly does not have a dimension for logged-in status, though it appears that data point is available in Hive from X-Analytics. If the Druid dataset could be updated to include that as a dimension, it would be possible to get data for logged-in users, too.

Until then, impression rates will encompass both logged-in and anonymous users. This isn't a significant problem for Fundraising campaign impression rates, since only a small percentage of users ever log in.

Language information is also not accurate for logged-in users, but this could also be fixed by adding the data in Hive and pageviews_hourly (same solution as below, for language in multilingual projects).

Language for multilingual projects

Initially, we might simply omit data for multilingual projects (like commons and meta). This isn't a problem for Fundraising campaigns, since they don't target those projects. This limitation could be overcome by adding user language to Hive pageview data and then to pageviews_hourly in Druid.

Campaign fallback

We have not yet adapted banner_activity_minutely to take into account campaign fallback, so we'd only surface impression rates for the last campaign selected in the fallback loop. Taking into account campaigns fallen back from is not straightforward, but likely possible. See T236734.

Regional geotargeting

Once regional geotargeting for CentralNotice is released, it could be added to banner_activity_minutely. After that happens, the proposed impression rate dataset could also include it.

Event Timeline

Milimetric moved this task from Incoming to Smart Tools for Better Data on the Analytics board.
Milimetric subscribed.

We're ready to help here, just help us coordinate.

Also questions for @AndyRussG: do you want to turn up sampling to 100%? Right now it's 1% and is around 600 events per minute (meaning we're ok with 100% sampling now, from the point of view of scale).

DStrine lowered the priority of this task from High to Lowest.Jun 11 2020, 4:30 PM

This is a good conversation but looking at the fr-tech roadmap in the next few months, this is out of scope. We're pushing hard to get some things out in the next 4 weeks so we might start conversations again in July or August but no guarantees.

We're ready to help here, just help us coordinate.

Thanks so much! Hugely appreciated!! :)

Also questions for @AndyRussG: do you want to turn up sampling to 100%? Right now it's 1% and is around 600 events per minute (meaning we're ok with 100% sampling now, from the point of view of scale).

You mean client-side sampling for the events that ingress data into banner_activity_minutely?

The client-side sampling rate for those events varies, depending on the settings for the CentralNotice campaign that was selected on the client. Most non-Fundraising campaigns sample at 1%. Fundraising campaigns are at 100%.

That's why the Durid measure we'd want to look at is normalized_request_count, which will provide the estimated total number of events for sampled campaigns based on sampling rate, or the actual total number of events for campaigns at 100% sample rate. Here is how it's computed by the oozie job.

Thanks so so much once again! :D

Looked at this a little bit closer. I don't see any reason it can't be implemented whenever you prioritize it. One small detail is that you can just join banner_activity_minutely to pageview_hourly instead of looping over all permutations. Hive can do cross joins, we can hash out the details when you get there.