Page MenuHomePhabricator

Create new table for 'referer' aggregated data
Open, MediumPublic13 Estimated Story Points

Description

Goal is to keep referrer informationfor long term.

Event Timeline

JAllemandou raised the priority of this task from to Needs Triage.
JAllemandou updated the task description. (Show Details)
JAllemandou added a project: Analytics-Backlog.
JAllemandou added subscribers: JAllemandou, Ironholds.

The referer table will contain:

  • Referer (normalised hostname)
  • Request counts for that referer
  • Country
  • Wikimedia project (hostname)
  • Agent type (user/bot)
  • Origin (internal/external)
  • Tag (pageview, api call...)

Needs an oozie job that runs every hour, refine data and generate table.
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream_top_referrers

fdans set the point value for this task to 13.Nov 6 2017, 5:35 PM
Nuria added a subscriber: DarTar.
Nuria removed a subscriber: Ironholds.
Milimetric moved this task from Dashiki to Incoming on the Analytics board.
Nuria lowered the priority of this task from Medium to Low.Apr 5 2018, 5:13 PM
Nuria moved this task from Incoming to Backlog (Later) on the Analytics board.
mforns raised the priority of this task from Low to Medium.Aug 10 2020, 4:19 PM
mforns moved this task from Backlog (Later) to Datasets on the Analytics board.
Milimetric raised the priority of this task from Medium to Needs Triage.Oct 22 2021, 2:46 PM
Milimetric moved this task from Datasets to Incoming on the Analytics board.
Milimetric added subscribers: odimitrijevic, Milimetric.

We were asked by Partnerships to re-evaluate the priority of this task. cc @odimitrijevic

I just realized I personally finished and merged some of this work with Isaac and Baha earlier this year:

presto:wmf> desc referrer_daily;
ColumnTypeExtraComment
countryvarcharReader country per IP geolocation
langvarcharWikipedia language -- e.g., en for English
browser_familyvarcharBrowser family from user-agent
os_familyvarcharOS family from user-agent
search_enginevarcharOne of ~20 standard search engines (e.g., Google)
num_referralsintegerNumber of pageviews from the referral source
yearintegerpartition keyUnpadded year of request
monthintegerpartition keyUnpadded month of request
dayintegerpartition keyUnpadded day of request

This table is being updated daily and has recent data. The task that implemented it was T270140. But that's also not marked done or in Kanban. Worth discussing.

@Milimetric What is the list of the search engines that we keep the data for?

odimitrijevic moved this task from Incoming to Datasets on the Analytics board.

@Milimetric What is the list of the search engines that we keep the data for?

The current list of search engines:

https://gerrit.wikimedia.org/g/analytics/refinery/source/+/14f13cb14a94b7f0be2a53bfc2c0bd99c5248896/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/SearchEngine.java#25

GOOGLE("Google", "google\\.", ""),
GOOGLE_TRANSLATE("Google Translate", "translate\\.googleusercontent\\.", "prev=search|client=srp"),
YAHOO("Yahoo", "search\\.yahoo\\.", ""),
BING("Bing", "\\.bing\\.", ""),
YANDEX("Yandex", "yandex\\.", ""),
BAIDU("Baidu", "\\.baidu\\.", ""),
DDG("DuckDuckGo", "duckduckgo\\.", ""),
ECOSIA("Ecosia", "\\.ecosia\\.", ""),
STARTPAGE("Startpage", "\\.(startpage|ixquick)\\.", ""),
NAVER("Naver", "search\\.naver\\.", ""),
DOCOMO("Docomo", "\\.docomo\\.", ""),
QWANT("Qwant", "qwant\\.", ""),
DAUM("Daum", "search\\.daum\\.", ""),
MYWAY("MyWay", "search\\.myway\\.", ""),
SEZNAM("Seznam", "\\.seznam\\.", ""),
AU("AU", "search\\.auone\\.", ""),
ASK("Ask", "\\.ask\\.", ""),
LILO("Lilo", "\\.lilo\\.", ""),
COC_COC("Coc Coc", "coccoc\\.", ""),
AOL("AOL", "search\\.aol\\.", ""),
RAKUTEN("Rakuten", "\\.rakuten\\.", ""),