Create new table for 'referer' aggregated data
Open, MediumPublic13 Estimated Story Points
Actions

Assigned To

None

Authored By

	JAllemandou
	Sep 11 2015, 4:26 PM

Description

Goal is to keep referrer informationfor long term.

Related Objects
Search...

Status	Assigned	Task
Open	None	T112284 Create new table for 'referer' aggregated data
Duplicate	None	T112911 Define a first set of metrics to be worked for wikistats 2.0 {lama} [8 pts]
Resolved	Milimetric	T114669 Spike: Understand how Wikistats Traffic reports are computed {lama} [8 pts]

Event Timeline

JAllemandou created this task.Sep 11 2015, 4:26 PM

JAllemandou raised the priority of this task from to Needs Triage.

JAllemandou updated the task description. (Show Details)

JAllemandou added a project: Analytics-Backlog.

JAllemandou added subscribers: JAllemandou, Ironholds.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 11 2015, 4:26 PM

JAllemandou mentioned this in T108886: Analyze referrer traffic to determine report format {hawk} [8 pts].Sep 11 2015, 4:27 PM

• madhuvishy triaged this task as Medium priority.Sep 17 2015, 5:24 PM

• madhuvishy added a subtask: T112911: Define a first set of metrics to be worked for wikistats 2.0 {lama} [8 pts].

• madhuvishy set Security to None.

• madhuvishy moved this task from Incoming to Prioritized on the Analytics-Backlog board.

Milimetric updated the task description. (Show Details)Oct 5 2015, 4:44 PM

Milimetric moved this task from Prioritized to Blocked on the Analytics-Backlog board.

Milimetric moved this task from Blocked to Prioritized on the Analytics-Backlog board.Nov 12 2015, 6:09 PM

• ggellerman edited projects, added Analytics; removed Analytics-Backlog.Jan 12 2016, 7:30 PM

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.Jan 12 2016, 7:30 PM

• Nuria moved this task from Analytics Query Service to Dashiki on the Analytics board.May 2 2016, 4:45 PM

Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 7 2016, 5:57 PM

• Tbayer subscribed.Sep 8 2016, 8:10 AM

• Nuria moved this task from Backlog (Later) to Wikistats on the Analytics board.Oct 5 2016, 4:54 PM

• Nuria moved this task from Wikistats to Dashiki on the Analytics board.Apr 24 2017, 4:23 PM

• fdans moved this task from Dashiki to Backlog (Later) on the Analytics board.Jul 27 2017, 4:09 PM

• fdans moved this task from Backlog (Later) to Operational Excellence Future on the Analytics board.Oct 9 2017, 4:24 PM

The referer table will contain:

Referer (normalised hostname)
Request counts for that referer
Country
Wikimedia project (hostname)
Agent type (user/bot)
Origin (internal/external)
Tag (pageview, api call...)

Needs an oozie job that runs every hour, refine data and generate table.
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream_top_referrers

• fdans set the point value for this task to 13.Nov 6 2017, 5:35 PM

• Nuria updated the task description. (Show Details)Nov 6 2017, 7:37 PM

• Nuria added a subscriber: • DarTar.

• Nuria removed a subscriber: Ironholds.

• Nuria moved this task from Operational Excellence Future to Dashiki on the Analytics board.Jan 11 2018, 5:35 PM

Milimetric moved this task from Dashiki to Incoming on the Analytics board.Apr 2 2018, 3:32 PM

Milimetric moved this task from Dashiki to Incoming on the Analytics board.

• Nuria lowered the priority of this task from Medium to Low.Apr 5 2018, 5:13 PM

• Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.

mforns raised the priority of this task from Low to Medium.Aug 10 2020, 4:19 PM

mforns moved this task from Backlog (Later) to Datasets on the Analytics board.

We were asked by Partnerships to re-evaluate the priority of this task. cc @odimitrijevic

I just realized I personally finished and merged some of this work with Isaac and Baha earlier this year:

presto:wmf> desc referrer_daily;

Column	Type	Extra	Comment
country	varchar		Reader country per IP geolocation
lang	varchar		Wikipedia language -- e.g., en for English
browser_family	varchar		Browser family from user-agent
os_family	varchar		OS family from user-agent
search_engine	varchar		One of ~20 standard search engines (e.g., Google)
num_referrals	integer		Number of pageviews from the referral source
year	integer	partition key	Unpadded year of request
month	integer	partition key	Unpadded month of request
day	integer	partition key	Unpadded day of request

This table is being updated daily and has recent data. The task that implemented it was T270140. But that's also not marked done or in Kanban. Worth discussing.

@Milimetric What is the list of the search engines that we keep the data for?

odimitrijevic triaged this task as Medium priority.Oct 25 2021, 3:54 PM

odimitrijevic moved this task from Incoming to Datasets on the Analytics board.

In T112284#7455411, @odimitrijevic wrote:

@Milimetric What is the list of the search engines that we keep the data for?

The current list of search engines:

https://gerrit.wikimedia.org/g/analytics/refinery/source/+/14f13cb14a94b7f0be2a53bfc2c0bd99c5248896/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngine.java#25

GOOGLE("Google", "google\\.", ""),
GOOGLE_TRANSLATE("Google Translate", "translate\\.googleusercontent\\.", "prev=search|client=srp"),
YAHOO("Yahoo", "search\\.yahoo\\.", ""),
BING("Bing", "\\.bing\\.", ""),
YANDEX("Yandex", "yandex\\.", ""),
BAIDU("Baidu", "\\.baidu\\.", ""),
DDG("DuckDuckGo", "duckduckgo\\.", ""),
ECOSIA("Ecosia", "\\.ecosia\\.", ""),
STARTPAGE("Startpage", "\\.(startpage|ixquick)\\.", ""),
NAVER("Naver", "search\\.naver\\.", ""),
DOCOMO("Docomo", "\\.docomo\\.", ""),
QWANT("Qwant", "qwant\\.", ""),
DAUM("Daum", "search\\.daum\\.", ""),
MYWAY("MyWay", "search\\.myway\\.", ""),
SEZNAM("Seznam", "\\.seznam\\.", ""),
AU("AU", "search\\.auone\\.", ""),
ASK("Ask", "\\.ask\\.", ""),
LILO("Lilo", "\\.lilo\\.", ""),
COC_COC("Coc Coc", "coccoc\\.", ""),
AOL("AOL", "search\\.aol\\.", ""),
RAKUTEN("Rakuten", "\\.rakuten\\.", ""),

Side note: the excluded countries are hardcoded here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/655804/14/oozie/referrer/daily/referrer.hql#36, perhaps we should revisit and use the centralized exclude list: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors/Public#Country_Protection_List

odimitrijevic added a project: Data-Engineering.Jan 6 2022, 3:22 AM

odimitrijevic moved this task from Incoming (new tickets) to Datasets on the Data-Engineering board.Jan 6 2022, 3:43 AM

odimitrijevic removed a project: Analytics.Jan 12 2022, 12:20 AM

JArguello-WMF moved this task from Datasets to Data Products & Metrics on the Data-Engineering board.Jun 29 2023, 11:45 PM

lbowmaker moved this task from Data Products & Metrics to Icebox (not considered in current quarter) on the Data-Engineering board.Nov 10 2023, 2:26 PM

Create new table for 'referer' aggregated dataOpen, MediumPublic13 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Create new table for 'referer' aggregated data
Open, MediumPublic13 Estimated Story Points
Actions

Related Objects
Search...