Page MenuHomePhabricator

Top Articles ad-hoc Report for Wikipedia Zero [5 pts]
Closed, ResolvedPublic

Description

Write a Hive query on pageview data for the month of May:

Assamese (as)
Bengali (bn)
Bhojpuri (bh)
Bishnupriya Manipuri (bpy)
English (en)
Gujarati (gu)
Hindi (hi)
Kannada (kn)
Kashmiri (ks)
Malayalam (ml)
Marathi (mr)
Nepali (ne)
Newari (new)
Odia (or)
Pali (pi)
Punjabi (pa)
Sanskrit (sa)
Sindhi (sd)
Tamil (ta)
Telugu (te)
Urdu (ur)

Output should be something like:

rank,en,hi,ta
1,...,...,...
2,...,...,...
3,...,...,...
4,...,...,...
5,...,...,...

Or maybe just one output for each language, because then consumers could mix and match.

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric subscribed.
kevinator renamed this task from Top Articles ad-hoc Report to Top Articles ad-hoc Report for Wikipedia Zero [5 pts].Jun 1 2015, 4:33 PM
kevinator triaged this task as Medium priority.
kevinator moved this task from Next Up to Tasked_Hidden on the Analytics-Kanban board.

Submitted a spark shell job, this should be done in a few hours. I'm parking the code here since it was an ad-hoc job:

// Gets top 100 pageviews for the month of May, for Indic-centric language wikipedias and English wikipedia
val card: Int = 100;

def sortBySecond(p1: (String, Long), p2: (String, Long)): Boolean = {
  p1._2 - p2._2 > 0
}

def topMap(top: Vector[(String, Long)], elem: (String, Long)): Vector[(String, Long)] = {
  (top :+ elem).sortWith(sortBySecond).take(card)
}
def topRed(t1: Vector[(String, Long)], t2: Vector[(String, Long)]): Vector[(String, Long)] = {
  (t1 ++ t2).sortWith(sortBySecond).take(card)
}

def tuplify(r: org.apache.spark.sql.Row): (String, (String, Long)) = {
    (r.getString(0), (r.getString(1), r.getLong(2)))
}


val df = sqlContext.parquetFile("/user/joal/pageviews/hourly2")
val sample = df.filter("year = 2015 and month = 5")
val filtered = sample.filter("agent_type = 'user' and project in ('as.wikipedia', 'bn.wikipedia', 'bh.wikipedia', 'bpy.wikipedia', 'en.wikipedia', 'gu.wikipedia', 'hi.wikipedia', 'kn.wikipedia', 'ks.wikipedia', 'ml.wikipedia', 'mr.wikipedia', 'ne.wikipedia', 'new.wikipedia', 'or.wikipedia', 'pi.wikipedia', 'pa.wikipedia', 'sa.wikipedia', 'sd.wikipedia', 'ta.wikipedia', 'te.wikipedia', 'ur.wikipedia')")
val counts = filtered.groupBy($"project", $"page_title").agg($"project", $"page_title", sum($"count") as "count")

I will add the results as soon as they're ready.

@Milimetric, can you update the description of the taks and specify the name of the table where the data will end up and how frequently it is updated?

This data will not be updated, it's a one-time run and the only output is a zip file:

This zip contains one file for each of the 21 requested wikipedias, with the 100 top articles in that wikipedia for the month of May.

Yay, it finished, yay :)

I had a quick look at the data and compared it to stats.grok.se
For example:

stats.grok.seReport Attached above
May Pageviews for Avengers:Age of Ultron2,339,0315,501,953
Sourcehttp://stats.grok.se/en/201505/Avengers:%20Age%20of%20Ultronmay.top.articles.wikipedia.en.tsv

In general, it seems the pageview counts in your report are twice as big as the numbers reported on stats.grok.se. I wonder why such a big discrepancy. Can it all be attributed to a different pageview definition?

we haven't vetted the numbers at all, we're just getting what's in the hourly aggregate. It's possible something's wrong, but it's also possible stats.grok.se hasn't been updated enough to get all the available data. It would be hard to compare this way, we should check against a different source.

OK. I only checked 5 articles, but the difference seems pretty consistent.