Page MenuHomePhabricator

Investigate lowering "per-article" resolution data in AQS
Closed, DeclinedPublic

Description

Investigate lowering "per-article" resolution data in AQS.

It is not clear that storing daily article resolution for pageviews in PageviewAPI delivers value, seems that there are very few stakeholders for this data. We need to investigate whether we can load data into cassandra with a TTL so in "x-amount-of-time" it expires and it is no longer available.

Implementing a strategy when it comes to resolution in pageview data in cassandra will be a goal for the team on Q3.

Event Timeline

Nuria triaged this task as Low priority.Mar 13 2017, 4:59 PM
Milimetric raised the priority of this task from Low to Needs Triage.Apr 2 2018, 3:52 PM
Milimetric moved this task from Wikistats to Deprioritized on the Analytics board.

There are several cases where a daily article resolution for pageviews could make sense, or even hourly or per minute. This is not so much for the usual article, but for special marker articles. Such statistics only (?) makes sense if it can be given with a geolocation.

Examples

  • daily pageviews to follow outbreak of diseases
  • hourly pageviews to follow tsunamis
  • per minute resolution to track meteors and earthquakes

The previous examples are tracked by other and more accurate systems, but there are perhaps examples of events that has no good tracking systems.

I would propose a system that merge at least three pageviews in time and space, a time-space cube satisfying some constraints, and keeping the cube with a resolution that makes it a hard problem to separate the events. It is possible to reformulate the problem slightly by aggregating information in a kind of statistical voxel, and only keeping a vector representation.

It has some similarities to tracking algorithms used in radars, but there can be multiple quite dissimilar patterns and not just point objects. An outbreak of a disease could have a filled circular pattern, starting at airports, a tsunami could form circular line patterns intersecting a coast line, while a meteor could form a linear moving filled circle.

Point is; low resolution may hamper using pageviews to investigate phenomena.

Another pretty kewl thing to do is to calculate which articles are trending in the morning. Because the whole pageview-mix is pretty noisy, you must first try to create a model for how the mix of articles changes through the day and week, and then try to figure out whether the observed change in the morning is for real or just ordinary noise.

And yes, I have already tried, and it works. You can even see the journalists starting to work on a case in the morning (05-06), checking background in Wikipedia, then the drop for the editorial meeting (07-08), and people starting to read the stories (09-10) and double-checking in Wikipedia.