Page MenuHomePhabricator

Strip out a www. prefix for the "project" parameter passed into the pageview API
Closed, ResolvedPublic3 Estimated Story Points

Description

I just wrote a nice little graph that can be easily pasted everywhere. It works fine on all the language wikipedias, but it fails on mediawiki.org because it uses {{SERVERNAME}} by default, which resolves to www.mediawiki.org, whereas pageviews api uses a strange mediawiki.org string instead. Please allow full server name for all wikis. Thanks!

Event Timeline

Yurik raised the priority of this task from to Needs Triage.
Yurik updated the task description. (Show Details)
Yurik added projects: Analytics, Pageviews-API.
Yurik added subscribers: Yurik, Milimetric.
Milimetric renamed this task from Use proper domain names for pageviews API to Accept a www. prefix for mediawiki and wikidata.Feb 18 2016, 6:11 PM
Milimetric moved this task from Incoming to Event Platform on the Analytics board.
Milimetric moved this task from Event Platform to Analytics Query Service on the Analytics board.
Milimetric triaged this task as Medium priority.Feb 18 2016, 6:26 PM
Milimetric moved this task from Analytics Query Service to Event Platform on the Analytics board.

@Milimetric - I think this is not exactly the "www" prefix, but rather always allowing canonical domains in addition to whatever shorter versions we have. When used from wiki markup, i can easily get the canonical domain, but it is hard to do string manipulations :)

@Yurik, the reason for changing the scope of the ticket is thus:

  • This is how we clean the very dirty hostname we get into a "project" name [1]
  • We use that project to load data into Cassandra, and this is not something we can change, there are terrabytes of data loaded already
  • So for better or worse, we have just "en.wikipedia" or "mediawiki" in Cassandra
  • We currently remove ".org" if it's passed in to map to the aforementioned convention we used

Therefore, the only sensible thing we can do is also remove "www." if the domain is prefixed with it. I guess we should do that generally, not just if mediawiki or wikidata are passed in. Hope that makes sense.

[1] https://github.com/wikimedia/analytics-refinery-source/blob/48ca37a0e618e5ea3595921fae9b37e802d641d2/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L329

Milimetric renamed this task from Accept a www. prefix for mediawiki and wikidata to Strip out a www. prefix for the "project" parameter passed into the pageview API.Feb 18 2016, 9:41 PM

We are currently looking into rewriting /api/rest_v1/ in RESTBase itself (related to T127370), and stripping the www prefix from the host / domain could be something we can do in the same vein. @Pchelolo has started to create a request filter / middleware concept along the lines of T127132, which should let us do such API-specific mangling without hacks in HyperSwitch.

mmm, but this would be in the {project} parameter passed to AQS, not the domain. This could apply to the AQS config that we'll put on each wiki later, but we still have to handle the problem in the AQS backend.

Change 275681 had a related patch set uploaded (by Milimetric):
Strip out www. in front of project names

https://gerrit.wikimedia.org/r/275681

Milimetric set the point value for this task to 3.Mar 10 2016, 5:17 PM

Change 275681 merged by Milimetric:
Strip out www. in front of project names

https://gerrit.wikimedia.org/r/275681