Page MenuHomePhabricator

NEW FEATURE REQUEST: <Update webrequest derived tables to use the new column for referer data>
Closed, ResolvedPublic5 Estimated Story PointsFeature


Data Platform Request Form

Is this a request for a:

  • Dataset
  • Data Pipeline
  • Data Feature

Please provide the description of your request:
If we want the business and data analysts to use the new field referer_data, added as a part of T327074, we Have to add it to pageview_hourly, pageview_actor and pageview_daily tables.

Note: referer_class has been updated with the new type external(media sites). We will need a new column called referer_name which will provide name of the search engine (Google, Bing etc.) and media site (TikiTok, Instagram etc. ) against the external(media sites) and external (search engine) referer classes.

Use Case: (Please explain what this feature will be used for):

Product Analytics doesn't regularly query the webrequest table because it's huge and granular, it has a lot of data. So the team uses a lot of the derived tables or the aggregated tables for analyses. Therefore, we are requesting DE to add a new referer_name column to hourly and daily tables in Hive and Druid for Product Analytics and other stakeholders like Partnerships, Audience Insights etc. to be able to query and chart the data easily in Superset and Turnilo and use it effectively to make informed decisions.


Ideal Delivery Date: end of Q3

Dataset Checklist
  • Provide link to CSV/GSheet example data. Link: ____
  • Provide link to the desired Table Schema. Link: ____
  • Does this data contain anything that is sensitive, PII or Private?
    • Yes
    • No
    • I don't know
  • Who will own the data (Fix issues, update descriptions & metadata etc.)?
Datapipeline Checklist
  • Do you have the transformation you like to be applied Link: ____
  • Does this data need to be linked to other data in the Data Lake?
    • Yes
    • No
Data Feature Checklist

Please link to the following if applicable.

Document TypeRequired?Document/Link
Related PHAB TicketsYes<add link here>
Product One PagerYes<add link here>
Product Requirements Document (PRD)Yes<add link here>
Product RoadmapNo<add link here>
Product Planning/Business CaseNo<add link here>
Product BriefNo<add link here>
Other LinksNo<add links here>
For Data Engineering Team to fill out:
Value CalculatorRank
Will this improve the efficiency of a teams workflow?1-3
Does this have an effect of our Core Metrics?1-3
Does this align with our strategic goals?1-3
Is this a blocker for another team?1-3

Event Timeline renamed this task from NEW FEATURE REQUEST: <Update webrequest derived tables to use a new column for referer data> to NEW FEATURE REQUEST: <Update webrequest derived tables to use the new column for referer data>.Mar 2 2023, 7:01 PM updated the task description. (Show Details) added subscribers: JAllemandou,

Key decisions from today's meeting with @JAllemandou and @lbowmaker

  • changing the request to add one new column referer_name, since referer_class has already been updated with external (media sites) (updated the task description)
  • adding the referer_name has PII implications. We will need to reach out to Legal, Safety and Security Service Center to get sign off before adding this field
  • the additional field will not have cardinality issues in Druid as the number of values in the referer_name are < 100
  • Note for Data Engineer : in order to ensure Hive and Spark are aligned, it is advisable to do these updates on SparkSQL (which will automatically update the Hive metastore).

Update: Privacy request submitted on 3/7. Responded for additional information on 3/14.

We have gotten the okay from Privacy team. See Privacy review - Pageview referrer column 20230314
"Overall, the proposed change already embeds adequate privacy safeguard. Therefore, it poses a low level of privacy risk, which can be automatically accepted by Product Analytics, as per the Security team’s risk management policy."

And we have approval from Legal as well ! L3SC Privacy Request: Product Analytics - Track referer name

Privacy legal conclusion: low risk

Public-facing: While our internal tables collect pageviews based on views per hour and views per day (and some of these external sites do have less than 1000 views per hour, for instance: Coursera), we only publish metrics for monthly pageviews. When considering the monthly pageviews, even the views for the smaller sites on the list consistently add up to more than 1000 views. This attends to our recommendations on only publishing data referring to subsets of >1000 people in order to keep it low-risk. Referrer names for individual articles, user pages and commons page will not be published.

Internal-facing: Although the internal metrics may store data on less visited sites, the fact that we will only be collecting the referrer name for 14 media sites (Youtube, Twitter, Facebook, Reddit, TikTok, Quora, Instagram, ycombinator, LinkedIn, Github, Pinterest, Coursera, Medium, Stackoverflow) and 23 search engines (see list here) attends to data minimization concerns and alleviates the re-identification risk. Users from other media sites/search engines outside of this list will not have a value for referer_name.The re-identification risk was also deemed low by Privacy Engineering.

@lbowmaker / @JArguello-WMF, now that we have approval, can we pls prioritize this task for the next DE sprint?
cc @JAllemandou