Spike: How could we show a page listing most visited collections
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdlrobson
	Mar 27 2015, 9:56 PM

Description

We would like to do a variation of T96222 but based on most visited.

Duration: 2hrs

What are our options for working this data?
- Talk to @DarTar or someone in analytics about if and how often this sort of data can be collected
Where can we publish it to? static json file?
How can we access this information in the special page?

Related Objects

Mentioned Here: T96222: Add a new public collection feed view

Event Timeline

Jdlrobson created this task.Mar 27 2015, 9:56 PM

Jdlrobson raised the priority of this task from to Needs Triage.

Jdlrobson updated the task description. (Show Details)

Jdlrobson added a project: Gather.

Jdlrobson moved this task to Some day on the Gather board.

Jdlrobson subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 27 2015, 9:56 PM

Jdlrobson moved this task from Some day to This Sprint on the Gather board.Apr 2 2015, 3:50 PM

Jdlrobson updated the task description. (Show Details)Apr 16 2015, 5:58 PM

Jdlrobson added a project: Gather Sprint Greatest Hits.

Jdlrobson set Security to None.

Jdlrobson added a subscriber: • DarTar.

Jdlrobson added a subscriber: Tgr.Apr 27 2015, 4:23 PM

Jdlrobson moved this task from In Analysis to Ready for dev on the Gather Sprint Greatest Hits board.Apr 27 2015, 6:00 PM

Jdlrobson moved this task from This Sprint to In sprint on the Gather board.Apr 30 2015, 5:35 PM

how is the data collected?
- Gather list URLs are in the form /wiki/Special:Gather/id/<list id>/<list label>. (id=0 is a special case and should be ignored.) All /wiki/ requests are recorded in Hadoop, but I imagine aggregating them for a given id is non-trivial.
- Special:Gather is not cached so counting views directly from PHP code (e.g. via StatsD) is a possibility. Seems like a really bad idea though.
how is the data accessed?
- most analytics infrastructure has lower availability than reader-facing production services; if there is an outage, popularity-based search/sort should degrade sanely. So accessing data directly from some analytics system is problematic.
- could cache it in the DB and update periodically.

Jdlrobson moved this task from Ready for dev to In development on the Gather Sprint Greatest Hits board.May 6 2015, 12:07 PM

This is how I imagine this could work (holes need filling):
We generate a static file/json daily(?) of the top 50 visited collections based on access logs [1] and we push this somewhere [2] as a csv file ordered by views. From the Gather side, we simply read this file and render view using the CollectionList model/view.

The data would be:
<id>,<number of visits>
<id>,<number of visits>
etc..

[1] Is this possible?
[2] where I don't know

Using files for this seems like a pain, you would make sure to keep them synchronized with the DB. (What if some of the top collections get hidden/deleted?) At the same time, there isn't any performance advantage to it since we need to do DB requests to show that view anyway. And if we want caching, we can just do it on the Varnish level. It will be a lot easier to just keep everything in the DB.

Mmm I hadn't thought of deleted collections.
@Tgr makes sense. Just not clear how we do [1] or [2]
@Milimetric any ideas?

So our plan for this is to make a stats endpoint on RESTBase. We talked to Gabriel and he rightly thinks we should have our own backend. So that's what we want to do.

In the meantime, I don't think the team is ready to open up the cluster for this kind of work. We've had availability problems with just the researchers running their daily work, so we couldn't guarantee any level of reliability. We have a request for more hardware and we're thinking when we get that is when we'll be able to allow ad-hoc work.

So, sadly, the only pageview data we currently make available is on http://dumps.wikimedia.org/other/pagecounts-all-sites/ in the form of flat files :/

@Milimetric is there a card for stats endpoint? It looks like we're blocked until then then :) Thanks for this infromation!

There's no card for it, it's a stretch goal for this quarter, but Event Logging problems are really holding us back. I'd normally offer to help with some ad-hoc solution but honestly I think we don't have the bandwidth. If you guys feel this is critical, we're happy to help one of you start working on the stats endpoint early. RESTBase endpoints are pretty easy to write, so the hard part is really standing up the back-end. But if you want to start chipping away at it, let's talk.

No problem. Stretch goal our end too. Thanks for the discussion @Tgr and @Milimetric :)

@JKatzWMF any questions?

Jdlrobson reassigned this task from Jdlrobson to Tgr.May 7 2015, 9:52 PM

@Jdlrobson @Tgr -- i see. I think I know the answer to this, but is there any reason you could schedule flat file dumps, pull just the gather pages and add tally to collection table? or is that a shit-ton of work. Sounds like it might be.

If so, I think we have our answer for now ;)

• JKatzWMF moved this task from Ready for signoff to Done on the Gather Sprint Greatest Hits board.May 7 2015, 10:53 PM

Seems doable. A hourly dump file is about 500MB; grep processes it in a couple seconds.

Spike: How could we show a page listing most visited collectionsClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Spike: How could we show a page listing most visited collections
Closed, ResolvedPublic
Actions