Page MenuHomePhabricator

Add referrer to mediarequests dataset to inform about project
Closed, ResolvedPublic

Description

There's an important caveat with the mediacounts data. Originally we had a rough design of the media request metrics (wikitech) in which, much like the page view metrics, they would have project as a dimension. However, the mediacounts dataset has no idea of which project is using the requested file. A Commons image could be requested from Wikipedia or Wikispecies, but it could also not be used at all from a wiki, and instead be hotlinked from a non-wiki page.

So, as it is, the mediacounts data can't be directly split by project. However, the URIs in the dataset contain data about which wiki was the file uploaded for. According to the definition of the mediacounts dataset, these are all the possible ways that an upload.wikimedia.org URI can start with:

/math/*
/score/*
/wikibooks/{language}/*
/wikinews/{language}/*
/wikimedia/{language}/*
/wikipedia/{language}/*
/wikiquote/{language}/*
/wikisource/{language}/*
/wikiversity/{language}/*
/wikivoyage/{language}/*
/wiktionary/{language}/*
/favicon.ico/*

Where {language} can be a valid ISO language code as used in wiki URLs (en, be, ca) or commons.

Examples:

/wikipedia/bar/timeline/79799ec4287e24767726404d81fc8897.png
/wikipedia/commons/0/00/Ancient_Egypt_map-el.png
/score/l/4/l42r5igkbg59dqectsgpc0807oh7of3/l42r5igk.png
/math/e/9/4/e94049b807364f202efe747fd69f247e.png

So we could expand again the GetMediaFilePropertiesUDF to add info about the project/namespace.

Event Timeline

It should not be hard to parse the referer and find out which project the browser is referred from. Knowing where the image was uploaded is probably good for something but in no way a replacement for the normal project field.

@Tgr agreed, we can definitely get the project from the referer, but the main value of this dataset is its historical aspect. We can't backfill project using referer further than the last 90 days (as it is purged with webrequest), so the period between 2014 and three months ago would still be without a project dimension.

People will just have to live with that IMO. Which project a file was uploaded to is certainly useful information, but using it as the project field seems pretty misleading.

/wikipedia/{language}/*

Note that that can also be a project name for a multilingual/language-less project, not just a language. commons, meta, mediawiki, foundation, test are the multilang projects I can think off the top of my head which have their own uploads.

/wikimedia/{language}/*

Is that ever used?

Change 523903 had a related patch set uploaded (by Fdans; owner: Fdans):
[analytics/refinery/source@master] Add UDF to get wiki project from referer string

https://gerrit.wikimedia.org/r/523903

Milimetric moved this task from Incoming to Analytics Query Service on the Analytics board.

Change 523903 merged by Nuria:
[analytics/refinery/source@master] Add UDF to get wiki project from referer string

https://gerrit.wikimedia.org/r/523903

Nuria renamed this task from The mediacounts dataset doesn't have a project dimension to Add referrer to mediarequests dataset to inform about project .Sep 6 2019, 3:59 PM
Nuria removed subscribers: Ian_Furst, Doc_James, WMDE-leszek and 11 others.

Ping @fdans thsi ticket can probably be moved to "done" in the canvas no?

mforns claimed this task.
mforns added a project: Analytics-Kanban.