Page MenuHomePhabricator

Top Articles ad-hoc Report for Wikipedia Zero [5 pts]
Closed, ResolvedPublic


Write a Hive query on pageview data for the month of May:

Assamese (as)
Bengali (bn)
Bhojpuri (bh)
Bishnupriya Manipuri (bpy)
English (en)
Gujarati (gu)
Hindi (hi)
Kannada (kn)
Kashmiri (ks)
Malayalam (ml)
Marathi (mr)
Nepali (ne)
Newari (new)
Odia (or)
Pali (pi)
Punjabi (pa)
Sanskrit (sa)
Sindhi (sd)
Tamil (ta)
Telugu (te)
Urdu (ur)

Output should be something like:


Or maybe just one output for each language, because then consumers could mix and match.

Event Timeline

Milimetric raised the priority of this task from to Needs Triage.
Milimetric updated the task description. (Show Details)
Milimetric added a subscriber: Milimetric.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2015, 3:24 PM
Milimetric updated the task description. (Show Details)May 14 2015, 11:19 PM
Milimetric set Security to None.
kevinator updated the task description. (Show Details)Jun 1 2015, 4:31 PM
kevinator renamed this task from Top Articles ad-hoc Report to Top Articles ad-hoc Report for Wikipedia Zero [5 pts].Jun 1 2015, 4:33 PM
kevinator triaged this task as Normal priority.
kevinator moved this task from Next Up to Tasked_Hidden on the Analytics-Kanban board.
ggellerman moved this task from Tasked_Hidden to In Progress on the Analytics-Kanban board.

Submitted a spark shell job, this should be done in a few hours. I'm parking the code here since it was an ad-hoc job:

// Gets top 100 pageviews for the month of May, for Indic-centric language wikipedias and English wikipedia
val card: Int = 100;

def sortBySecond(p1: (String, Long), p2: (String, Long)): Boolean = {
  p1._2 - p2._2 > 0

def topMap(top: Vector[(String, Long)], elem: (String, Long)): Vector[(String, Long)] = {
  (top :+ elem).sortWith(sortBySecond).take(card)
def topRed(t1: Vector[(String, Long)], t2: Vector[(String, Long)]): Vector[(String, Long)] = {
  (t1 ++ t2).sortWith(sortBySecond).take(card)

def tuplify(r: org.apache.spark.sql.Row): (String, (String, Long)) = {
    (r.getString(0), (r.getString(1), r.getLong(2)))

val df = sqlContext.parquetFile("/user/joal/pageviews/hourly2")
val sample = df.filter("year = 2015 and month = 5")
val filtered = sample.filter("agent_type = 'user' and project in ('as.wikipedia', 'bn.wikipedia', 'bh.wikipedia', 'bpy.wikipedia', 'en.wikipedia', 'gu.wikipedia', 'hi.wikipedia', 'kn.wikipedia', 'ks.wikipedia', 'ml.wikipedia', 'mr.wikipedia', 'ne.wikipedia', 'new.wikipedia', 'or.wikipedia', 'pi.wikipedia', 'pa.wikipedia', 'sa.wikipedia', 'sd.wikipedia', 'ta.wikipedia', 'te.wikipedia', 'ur.wikipedia')")
val counts = filtered.groupBy($"project", $"page_title").agg($"project", $"page_title", sum($"count") as "count")

I will add the results as soon as they're ready.

@Milimetric, can you update the description of the taks and specify the name of the table where the data will end up and how frequently it is updated?

This data will not be updated, it's a one-time run and the only output is a zip file:

This zip contains one file for each of the 21 requested wikipedias, with the 100 top articles in that wikipedia for the month of May.

Yay, it finished, yay :)

kevinator added a comment.EditedJun 8 2015, 6:32 PM

I had a quick look at the data and compared it to
For example:

stats.grok.seReport Attached above
May Pageviews for Avengers:Age of Ultron2,339,0315,501,953

In general, it seems the pageview counts in your report are twice as big as the numbers reported on I wonder why such a big discrepancy. Can it all be attributed to a different pageview definition?

we haven't vetted the numbers at all, we're just getting what's in the hourly aggregate. It's possible something's wrong, but it's also possible hasn't been updated enough to get all the available data. It would be hard to compare this way, we should check against a different source.

OK. I only checked 5 articles, but the difference seems pretty consistent.

kevinator closed this task as Resolved.Jun 11 2015, 3:37 PM