Page MenuHomePhabricator

NEW BUG REPORT : Referer Name field is Null on Druid pageviews_daily
Closed, ResolvedPublicBUG REPORT

Description

Data Platform Engineering Bug Report or Data Problem Form.

Please fill out the following
Please ensure you set priority

What kind of problem are you reporting?

  • Access related problem
  • Service related problem
  • Data related problem
For a data related problem:
  • Is this a data quality issue?

Yes

  • What datasets and/or dashboards are affected?

Druid - pageviews_daily

  • What are the observed vs expected results? Please include information such as location of data, any initial assessments, sql statements, screenshots

pageviews_hourly has referer_name field which is not populated in pageviews_daily. We are interested in looking at referal data in detail and would like to visualize it on Turnilo and Superset.

Current result:
All referer names are null - https://w.wiki/FZv3

image.png (675×999 px, 69 KB)

Expected result:
pageviews_daily should have valid values for referer name, similar to pageviews_hourly - https://w.wiki/FZv4
image.png (732×1 px, 94 KB)

image.png (411×588 px, 67 KB)

Details

Related Changes in Gerrit:
Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add referer_name to pageviews_daily druid loadingrepos/data-engineering/airflow-dags!1734joalupdate_pageviews_daily_druid_loadmain
Customize query in GitLab

Event Timeline

hmm, apparently this was done as a part of T334104#8759879, but I wonder if we encountered a bug later and the data stopped loading from hive to druid?

The referer_name field is loaded in the pageviews_hourly datasource containing 90 days of data.
We don't load it in the pageviews_daily, to reduce data size as this datasource contains multiple years of data. We have the referer_class but not the name.
Is this enough for you to have the latest 90 days in turnilo knowing that longer term data is available in the wmf_readership.referer_daily table (this table is optimized for superset, response times should be fast :).

@JAllemandou - We would prefer to have the full history of referer_name in pageviews_daily data sources. We have 'Turnilo-only' users who are interested in viewing the full history of individual referrers and Turnilo offers a much better experience compared to Superset. Will loading this additional field cause any additional challenges beyond the extra storage cost?

@OSefu-WMF - I think druid can cope with the field addition. The datasource currently weights 250Gb, and I assume that adding the new field will not make it grow more than 50Gb (20/30Gb is probably closer to reality). The druid cluster can handle that. Adding the field will require a full reload of the datasource, which will take time.
Also, I'd be interested to have a broader discussion on turnilo vs superset, which is actually more about Druid vs Presto. I'd like for us to remove druid from our analytics use-cases, replacing it by dedicated datasets to be used with presto, providing a similar experience (probably a bit slower) in Superset. The concern is that this implies removing Turnilo as it's only available for Druid, which does't seem to be an option.
I'd be happy to have your views on this.

Change #1194821 had a related patch set uploaded (by Joal; author: Joal):

[analytics/refinery@master] Add referer_name to druid pageview_daily

https://gerrit.wikimedia.org/r/1194821

Change #1194821 merged by Joal:

[analytics/refinery@master] Add referer_name to druid pageview_daily

https://gerrit.wikimedia.org/r/1194821

Mentioned in SAL (#wikimedia-analytics) [2025-10-09T07:52:02Z] <joal> Deploying refinery with scap to unlock T406530

The referer_name field got added on April, 11th 2023 (T334120). I'll backfill data starting from that date.

Data has been backfilled from 2025-04-01 to today using Airflow. I need a manual backfill of earlier data, I'll do this tomorrow or Monday.

All the data has been backfilled.