Maniphest T99083

Milimetric created this task.May 14 2015, 3:24 PM

Milimetric raised the priority of this task from to Needs Triage.

Milimetric updated the task description. (Show Details)

Milimetric added projects: Analytics-Kanban, patch-welcome.

Milimetric subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 14 2015, 3:24 PM

Milimetric updated the task description. (Show Details)May 14 2015, 11:19 PM

Milimetric set Security to None.

• kevinator updated the task description. (Show Details)Jun 1 2015, 4:31 PM

• kevinator renamed this task from Top Articles ad-hoc Report to Top Articles ad-hoc Report for Wikipedia Zero [5 pts].Jun 1 2015, 4:33 PM

• kevinator triaged this task as Medium priority.

• kevinator moved this task from Next Up to Tasked_Hidden on the Analytics-Kanban board.

• ggellerman assigned this task to Milimetric.Jun 3 2015, 3:43 PM

• ggellerman moved this task from Tasked_Hidden to In Progress on the Analytics-Kanban board.

Submitted a spark shell job, this should be done in a few hours. I'm parking the code here since it was an ad-hoc job:

// Gets top 100 pageviews for the month of May, for Indic-centric language wikipedias and English wikipedia
val card: Int = 100;

def sortBySecond(p1: (String, Long), p2: (String, Long)): Boolean = {
  p1._2 - p2._2 > 0
}

def topMap(top: Vector[(String, Long)], elem: (String, Long)): Vector[(String, Long)] = {
  (top :+ elem).sortWith(sortBySecond).take(card)
}
def topRed(t1: Vector[(String, Long)], t2: Vector[(String, Long)]): Vector[(String, Long)] = {
  (t1 ++ t2).sortWith(sortBySecond).take(card)
}

def tuplify(r: org.apache.spark.sql.Row): (String, (String, Long)) = {
    (r.getString(0), (r.getString(1), r.getLong(2)))
}


val df = sqlContext.parquetFile("/user/joal/pageviews/hourly2")
val sample = df.filter("year = 2015 and month = 5")
val filtered = sample.filter("agent_type = 'user' and project in ('as.wikipedia', 'bn.wikipedia', 'bh.wikipedia', 'bpy.wikipedia', 'en.wikipedia', 'gu.wikipedia', 'hi.wikipedia', 'kn.wikipedia', 'ks.wikipedia', 'ml.wikipedia', 'mr.wikipedia', 'ne.wikipedia', 'new.wikipedia', 'or.wikipedia', 'pi.wikipedia', 'pa.wikipedia', 'sa.wikipedia', 'sd.wikipedia', 'ta.wikipedia', 'te.wikipedia', 'ur.wikipedia')")
val counts = filtered.groupBy($"project", $"page_title").agg($"project", $"page_title", sum($"count") as "count")

I will add the results as soon as they're ready.

Milimetric moved this task from In Progress to Ready to Deploy on the Analytics-Kanban board.Jun 4 2015, 7:26 PM

@Milimetric, can you update the description of the taks and specify the name of the table where the data will end up and how frequently it is updated?

This data will not be updated, it's a one-time run and the only output is a zip file:

Hindi and English top articles May 2015.zip33 KBDownload

This zip contains one file for each of the 21 requested wikipedias, with the 100 top articles in that wikipedia for the month of May.

Yay, it finished, yay :)

• ggellerman moved this task from Ready to Deploy to Done on the Analytics-Kanban board.Jun 8 2015, 3:33 PM

I had a quick look at the data and compared it to stats.grok.se
For example:

	stats.grok.se	Report Attached above
May Pageviews for Avengers:Age of Ultron	2,339,031	5,501,953
Source	http://stats.grok.se/en/201505/Avengers:%20Age%20of%20Ultron	may.top.articles.wikipedia.en.tsv

In general, it seems the pageview counts in your report are twice as big as the numbers reported on stats.grok.se. I wonder why such a big discrepancy. Can it all be attributed to a different pageview definition?

• kevinator moved this task from Done to In Code Review on the Analytics-Kanban board.Jun 8 2015, 6:33 PM

• ggellerman moved this task from In Code Review to Done on the Analytics-Kanban board.Jun 9 2015, 3:41 PM

we haven't vetted the numbers at all, we're just getting what's in the hourly aggregate. It's possible something's wrong, but it's also possible stats.grok.se hasn't been updated enough to get all the available data. It would be hard to compare this way, we should check against a different source.

OK. I only checked 5 articles, but the difference seems pretty consistent.

• kevinator closed this task as Resolved.Jun 11 2015, 3:37 PM

Top Articles ad-hoc Report for Wikipedia Zero [5 pts]
Closed, ResolvedPublic
Actions

Description

Event Timeline

	F175293: Hindi and English top articles May 2015.zip
	Jun 5 2015, 8:46 PM

Top Articles ad-hoc Report for Wikipedia Zero [5 pts]Closed, ResolvedPublicActions

Description

Event Timeline

Top Articles ad-hoc Report for Wikipedia Zero [5 pts]
Closed, ResolvedPublic
Actions