Goal is to keep referrer informationfor long term.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T112284 Create new table for 'referer' aggregated data | |||
Duplicate | None | T112911 Define a first set of metrics to be worked for wikistats 2.0 {lama} [8 pts] | |||
Resolved | Milimetric | T114669 Spike: Understand how Wikistats Traffic reports are computed {lama} [8 pts] |
Event Timeline
The referer table will contain:
- Referer (normalised hostname)
- Request counts for that referer
- Country
- Wikimedia project (hostname)
- Agent type (user/bot)
- Origin (internal/external)
- Tag (pageview, api call...)
Needs an oozie job that runs every hour, refine data and generate table.
https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream_top_referrers
We were asked by Partnerships to re-evaluate the priority of this task. cc @odimitrijevic
I just realized I personally finished and merged some of this work with Isaac and Baha earlier this year:
presto:wmf> desc referrer_daily;
Column | Type | Extra | Comment |
country | varchar | Reader country per IP geolocation | |
lang | varchar | Wikipedia language -- e.g., en for English | |
browser_family | varchar | Browser family from user-agent | |
os_family | varchar | OS family from user-agent | |
search_engine | varchar | One of ~20 standard search engines (e.g., Google) | |
num_referrals | integer | Number of pageviews from the referral source | |
year | integer | partition key | Unpadded year of request |
month | integer | partition key | Unpadded month of request |
day | integer | partition key | Unpadded day of request |
This table is being updated daily and has recent data. The task that implemented it was T270140. But that's also not marked done or in Kanban. Worth discussing.
The current list of search engines:
GOOGLE("Google", "google\\.", ""), GOOGLE_TRANSLATE("Google Translate", "translate\\.googleusercontent\\.", "prev=search|client=srp"), YAHOO("Yahoo", "search\\.yahoo\\.", ""), BING("Bing", "\\.bing\\.", ""), YANDEX("Yandex", "yandex\\.", ""), BAIDU("Baidu", "\\.baidu\\.", ""), DDG("DuckDuckGo", "duckduckgo\\.", ""), ECOSIA("Ecosia", "\\.ecosia\\.", ""), STARTPAGE("Startpage", "\\.(startpage|ixquick)\\.", ""), NAVER("Naver", "search\\.naver\\.", ""), DOCOMO("Docomo", "\\.docomo\\.", ""), QWANT("Qwant", "qwant\\.", ""), DAUM("Daum", "search\\.daum\\.", ""), MYWAY("MyWay", "search\\.myway\\.", ""), SEZNAM("Seznam", "\\.seznam\\.", ""), AU("AU", "search\\.auone\\.", ""), ASK("Ask", "\\.ask\\.", ""), LILO("Lilo", "\\.lilo\\.", ""), COC_COC("Coc Coc", "coccoc\\.", ""), AOL("AOL", "search\\.aol\\.", ""), RAKUTEN("Rakuten", "\\.rakuten\\.", ""),
Side note: the excluded countries are hardcoded here: https://gerrit.wikimedia.org/r/c/analytics/refinery/+/655804/14/oozie/referrer/daily/referrer.hql#36, perhaps we should revisit and use the centralized exclude list: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Geoeditors/Public#Country_Protection_List