Page MenuHomePhabricator

Spike: How could we show a page listing most visited collections
Closed, ResolvedPublic

Description

We would like to do a variation of T96222 but based on most visited.

Duration: 2hrs

  • What are our options for working this data?
    • Talk to @DarTar or someone in analytics about if and how often this sort of data can be collected
  • Where can we publish it to? static json file?
  • How can we access this information in the special page?

Event Timeline

Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a project: Gather.
Jdlrobson moved this task to Some day on the Gather board.
Jdlrobson subscribed.
Jdlrobson set Security to None.
Jdlrobson added a subscriber: DarTar.
  • how is the data collected?
    • Gather list URLs are in the form /wiki/Special:Gather/id/<list id>/<list label>. (id=0 is a special case and should be ignored.) All /wiki/ requests are recorded in Hadoop, but I imagine aggregating them for a given id is non-trivial.
    • Special:Gather is not cached so counting views directly from PHP code (e.g. via StatsD) is a possibility. Seems like a really bad idea though.
  • how is the data accessed?
    • most analytics infrastructure has lower availability than reader-facing production services; if there is an outage, popularity-based search/sort should degrade sanely. So accessing data directly from some analytics system is problematic.
    • could cache it in the DB and update periodically.

This is how I imagine this could work (holes need filling):
We generate a static file/json daily(?) of the top 50 visited collections based on access logs [1] and we push this somewhere [2] as a csv file ordered by views. From the Gather side, we simply read this file and render view using the CollectionList model/view.

The data would be:
<id>,<number of visits>
<id>,<number of visits>
etc..

[1] Is this possible?
[2] where I don't know

Using files for this seems like a pain, you would make sure to keep them synchronized with the DB. (What if some of the top collections get hidden/deleted?) At the same time, there isn't any performance advantage to it since we need to do DB requests to show that view anyway. And if we want caching, we can just do it on the Varnish level. It will be a lot easier to just keep everything in the DB.

Mmm I hadn't thought of deleted collections.
@Tgr makes sense. Just not clear how we do [1] or [2]
@Milimetric any ideas?

So our plan for this is to make a stats endpoint on RESTBase. We talked to Gabriel and he rightly thinks we should have our own backend. So that's what we want to do.

In the meantime, I don't think the team is ready to open up the cluster for this kind of work. We've had availability problems with just the researchers running their daily work, so we couldn't guarantee any level of reliability. We have a request for more hardware and we're thinking when we get that is when we'll be able to allow ad-hoc work.

So, sadly, the only pageview data we currently make available is on http://dumps.wikimedia.org/other/pagecounts-all-sites/ in the form of flat files :/

@Milimetric is there a card for stats endpoint? It looks like we're blocked until then then :) Thanks for this infromation!

There's no card for it, it's a stretch goal for this quarter, but Event Logging problems are really holding us back. I'd normally offer to help with some ad-hoc solution but honestly I think we don't have the bandwidth. If you guys feel this is critical, we're happy to help one of you start working on the stats endpoint early. RESTBase endpoints are pretty easy to write, so the hard part is really standing up the back-end. But if you want to start chipping away at it, let's talk.

No problem. Stretch goal our end too. Thanks for the discussion @Tgr and @Milimetric :)

Jdlrobson claimed this task.
Jdlrobson added a subscriber: JKatzWMF.

@JKatzWMF any questions?

@Jdlrobson @Tgr -- i see. I think I know the answer to this, but is there any reason you could schedule flat file dumps, pull just the gather pages and add tally to collection table? or is that a shit-ton of work. Sounds like it might be.

If so, I think we have our answer for now ;)

Seems doable. A hourly dump file is about 500MB; grep processes it in a couple seconds.