Page MenuHomePhabricator

[EPIC] As a user, I want to see see a list of currently trending articles
Closed, DeclinedPublic

Description

Currently the apps use the pageviews API in order to show users a list of what was popular yesterday. While this is good for historical data, it would be better to provide a more timely list of popular articles.

To do this we need a more up to data data source to identify trending articles. @Jdlrobson as created a trending site on labs:
http://trending.wmflabs.org

Which is powered by an API:
http://trending.wmflabs.org/api/trending/

This works by consuming the edits stream (rcstream) and running an algorithm to determine if an article is trending by analyzing the edits. This appears to be a viable way to get a handle on trending articles.

To use this in production we need to port this work over to using the mobile content service as an API

The GET /trending/edits API should have 2 modes:

  1. Return a list of the currently trending articles in order of "trendiness" at any time.
  2. Return a list of trending articles chronologically in the order that they became "trendy"

During implementation, it should be considered that this service (or some of its logic) will be used to develop a trending push notification service in order to push the most significant trending articles to users.

Timeline for deployment is 2-3 months with iOS and Android using this service to power a new section in the Explore Feed.

See meeting notes here:
https://docs.google.com/document/d/1gr_Ww3JNAMPGcjdbuYLDdlDFVvNhknoC2ZOQf7biDPM/edit#

The JSON array for each page returned should use the same common keys/values as the feed endpoint (https://en.wikipedia.org/api/rest_v1/feed/featured/2016/06/08).

These keys includes: title, pageid, thumbnail, normalized title, description. See the example JSON:

"pages": [
{
"title": "Kimbo_Slice",
"pageid": 17709998,
"thumbnail": {
"source": "http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/Kimbo_Slice_1.jpg/207px-Kimbo_Slice_1.jpg",
"width": 207,
"height": 320
},
"normalizedtitle": "Kimbo Slice",
"description": "American professional wrestler"
},
{
"title": "Muhammad_Ali",
"pageid": 63747,
"thumbnail": {
"source": "http://upload.wikimedia.org/wikipedia/commons/thumb/8/89/Muhammad_Ali_NYWTS.jpg/256px-Muhammad_Ali_NYWTS.jpg",
"width": 256,
"height": 320
},
"normalizedtitle": "Muhammad Ali",
"description": "African American boxer, philanthropist and activist"
}
]

Prior to deployment we need to migrate to the Change Propagation service:
https://www.mediawiki.org/wiki/Change_propagation

@mobrovac can help guide the transition from the RCS to this. It should be relatively trivial.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

So currently the trending API endpoint does this with a few exceptions:

  1. Description returned as it is from API

I can easily fix this though.

  1. Uses pages key rather than articles

My goal is to have a trending API for all projects. Is there any reason you use "articles" rather than "pages"? Articles makes sense for Wikipedia but not so much for other projects.

  1. No rank key

What do this mean? Are they needed.

  1. visits can be 0

Just worth accounting for. If I know how many visits a page has had I include it. If I don't it's 0.

@Jdlrobson sorry for the confusion. We are updating the description to remove extraneous keys. Some specific answers:

  1. @bearND will give you instructions on finding/using the shared code to get the description formatted correctly.
  2. Pages is good for the outer key.
  3. Ignore this
  4. thats good to know. Can you make this "views" to align with the keys above?

I changed a to pages, removed rank and views in the description.

The [[ https://phabricator.wikimedia.org/diffusion/GMOA/browse/master/lib/mwapi.js;a8bbc9d560af5c9464033a7308e0174a32c3f895$227 | buildSummaryResponse in mwapi.js ]] does the formatting of the summary data. It's very simple, so I'm not sure if reusing makes a lot of sense right now, unless you're already reusing part of MCS library code.

Is there any reason you use "articles" rather than "pages"? Articles makes sense for Wikipedia but not so much for other projects.

No specific reason beyond that for the existing feed endpoint we only expected to be dealing with articles from the Wikipedias. "Pages" is fine, too.

  1. No rank key

What do this mean? Are they needed.

That's the article's rank from the raw metrics pageview API output (note that there are gaps in the rankings because non-mainspace pages and some others are stripped out). I passed the rank though to the response rather than stripping it as a kind of easy sanity check for the results and just in case we might want to use it at some point. But it's not necessary and not currently used for anything.

l'm a little confused so please raise a ticket here:
https://github.com/jdlrobson/weekipedia/issues
clearly articulating the changes you need and I'll do it.
Sounds like changing visits to views and formatting description but please sanity check me by raising that ticket :)

@Jdlrobson Yes, you are correct. I've submitted https://github.com/jdlrobson/weekipedia/issues/83 and https://github.com/jdlrobson/weekipedia/issues/84 for this.

About pages vs. articles, I very much prefer pages. I'm just not sure whether it's worth changing the existing most-read endpoint.

Here is a description of the event bus topics that can be used:

https://github.com/wikimedia/mediawiki-event-schemas/blob/master/config/eventbus-topics.yaml

Looks like this is the one we are interested in. Is that right @mobrovac ?

mediawiki.revision_create:
  schema_name: mediawiki/revision_create
Jdlrobson renamed this task from As a user, I want to see see a list of currently trending articles to [EPIC] As a user, I want to see see a list of currently trending articles.Sep 13 2016, 6:37 PM
Jdlrobson added a project: Trending-Service.
Jdlrobson updated the task description. (Show Details)

@Fjalapeno

Trending service would not be interacting with kafka topics directly, ChangeProp would do that for you (interacting with kafka is quite hard and painful). Here's the schema of the revision-create event:
https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/1.yaml

ChangeProp will have access to all the fields in this schema, can filter based on those fields (filters might be quite complex) and construct arbitrary requests using those fields to call the trending service

To give you a concrete idea, here's one event from the revision-create topic (which contains messages for events fired on edits):

{
  "comment": "",
  "database": "enwiki",
  "meta": {
    "domain": "en.wikipedia.org",
    "dt": "2016-09-14T14:49:43+00:00",
    "id": "78b59cb0-7a8a-11e6-87fa-b083fecf0303",
    "request_id": "d00045d0-15ac-4d59-8f8e-63d6bb5878c0",
    "schema_uri": "mediawiki/revision/create/1",
    "topic": "mediawiki.revision-create",
    "uri": "https://en.wikipedia.org/wiki/Katya_Santos"
  },
  "page_id": 3904643,
  "page_is_redirect": false,
  "page_namespace": 0,
  "page_title": "Katya_Santos",
  "performer": {
    "user_groups": [
      "*"
    ],
    "user_is_bot": false,
    "user_text": "<IP-address-was-here-removed-for-privacy>"
  },
  "rev_content_format": "wikitext",
  "rev_content_model": "wikitext",
  "rev_id": 739412403,
  "rev_len": 5465,
  "rev_minor_edit": false,
  "rev_parent_id": 725573218,
  "rev_sha1": "sw2rs008n1xthtdba2wqxblb9ddw7ew",
  "rev_timestamp": "2016-09-14T14:49:43Z"
}

This information can be forwarded to the service in any format you want. Also, fields can be excluded / reformatted if needed.

Hi, um, I have written a couple of realtime trending pageview prototypes in the past that have mostly worked. If this is something that yall want, I think it would be REALLY cool. Doing this based on edits is cool, and pageview API gives you trending pageviews, but not realtime. Not sure if that matters or not.

https://gerrit.wikimedia.org/r/#/c/225485/

There is also an (almost) working Flink one somewhere...

@Fjalapeno, @Jdlrobson, Is using revision creates (edits) really the best way of determining trending articles? Realtime windowed trending pageviews is very possible. E.g. https://flink.apache.org/news/2015/12/04/Introducing-windows.html

Not trying to block anything, but I think we should consider how this is often done in other places. Stream processing is something we'd wanted to provide for a while, but have not prioritized it because there hasn't been a lot of demand.

@Ottomato if that existed yes, that would be useful.

From our experiments editing works pretty well - https://pushipedia.wmflabs.org/trending - (especially if we're able to combine with ORES) - but it's not seen as the perfect solution - more an incremental step in the right direction that's achievable with the tools we have right now.

Anecodetely it takes time for Wikipedia pages to show up in search results (when things have trended on http://pushipedia.wmflabs.org I've searched in google for them and not found the Wikipedia article in results), so I'm not sure how pageviews will impact.

I've been using the daily page view data to score articles in http://trending.wmflabs.org with limited results but any other signals we can make use of it would be awesome.

Aye, cool.

The daily pageview data is computed batch style, and loaded directly from Hadoop into Cassandra to make it available via the Pageview API. The logic that knows how to classify incoming webrequests as pageviews is generic though, and can run in any generic Stream Processing system (like Spark Streaming, Flink, etc.). The PoC that I've played with consumes the streams of webrequest from Kafka, and computes a list of windowed trending article views (say, top 100 over the last minute, or 10 minutes, or whatever), emitted every 10 seconds or so.

This works now on the Analytics Hadoop cluster, but if we were to productionize it in such a way that user facing features would rely on such a stream, we'd want to run a stand alone stream processing cluster. We'd like to set one of these up, maybe within this FY. If this existed, could you make use of it?

I just want to suggest that stream processing is understood and considered for things like this, even if only as "wouldn't it be great if we could do it that way!?"

Could you surface such an api for labs? I'd happily play with it to see how it fits into the big picture of trending and then if it works well we will have good justification for productionising! :)

I want to second's @Ottomata's suggestions and concerns. Analyzing edit data will give us "popular edits", which is, no doubt, useful.
but those are not "readership trends", which can only be computed real time over the pageview stream. I am no so sure that popular edits is actually a proxy for trending pageviews in all instances.

This link needs to be proven before we go out to the world with an api with wikipedia's trending articles. Either that or let's name the api "popular edits" to truly represent its data source.

I've been using the daily page view data to score articles in http://trending.wmflabs.org with limited results but any other signals we can make use of it would be awesome.

The main issue with using daily pageview data this purpose is that this represents "popular articles" (with some bot traffic that we know is getting through now) nut not "trending" articles. You will miss any short term spike, which is the whole point of trending.

I want to second's @Ottomata's suggestions and concerns. Analyzing edit data will give us "popular edits", which is, no doubt, useful.
but those are not "readership trends", which can only be computed real time over the pageview stream. I am no so sure that popular edits is actually a proxy for trending pageviews in all instances.

Speaking from experience of using the feature since January, I don't agree with you, using editing data alone has been pretty accurate in getting relevant information to me and my guinea pigs. I've learned about deaths of Alan Rickman, David Bowie, the earthquakes in India and shootings within minutes of the story breaking on BBC news. I just learned about this: http://m.bbc.com/sport/football/37483344
Whether this would be picked up by a reading spike I'm not sure - maybe we could look at this particular example and see when a spike occurs on the Sam Allardyce article.

I'm still not sure how important page view trends are alone, but as I've already said, I think when combined with edits they will give the best signal.

Page views alone is probably not enough. My feeling is that if significant people are reading about something then it's no longer trending in that people already know about it.

if there is a spike in an article due to a popular post on Huffington Post going viral that's not necessarily that interesting, unless there's also been a spike in editing to that page which signals there has been an event. A good trending service will balance all these things and although we won't get it right 100% with just editing we will get closer.

Speaking from experience of using the feature since January, I don't agree with you, using editing data alone has been pretty accurate in getting >relevant information to me and my guinea pigs. I've learned about deaths of Alan Rickman, David Bowie,
the earthquakes in India and shootings within
minutes of the story breaking on BBC news. I just learned about this: http://m.bbc.com/sport/football/37483344

I do not dispute that, my point is more about what you are missing rather than what you are learning about. I think using edits has great value, but by itself it cannot give you trending pageviews, the naming "trending" is missleading. This data is "popular (good) edits", which, again, I agree is valuable data.

Page views alone is probably not enough. My feeling is that if significant people are reading about something
then it's no longer trending in that people already know about it.

That is why a trending calculation is different from a "top" computation. A "top" computation gives you staff that you already know. We know "David Bowie" is going to be on the top pages the day after its death, for example. "Trending data" is calculated on a shorter timespam (can even be done gepgraphically) and thus allows for discovery.

Page views alone is probably not enough.

For readership yes, it should be enough. Now, a pageview trending stream can be augmented too with edit data.

I am not against us doing this work but I am not in favor of calling this an api about "trending pages" when the methodology of calculation excludes (by definition) a bunch of trending signals.

I agree we are missing out on something. See https://phabricator.wikimedia.org/T140102#2656761 - this service is being built based on the existing constraints we have due to the lackings in our ecosystem.

I guess where we differ is that I'm thinking about the problem the opposite direction. I see editing as the strongest signal, real time page views as the second strongest. I am already using the daily pageviews to filter out things that should not trend.

What problems do you see calling an API (at least in terms of endpoint/repo) trending if the idea is to iterate on it and in future include other signals such as live page views?

What problems do you see calling an API (at least in terms of endpoint/repo) trending if the idea is to iterate on it
and in future include other signals such as live page views?

I think we just differ on naming. On my opinion we cannot re-invent the definition of "trending" which is about Readership, not Editing. If we are surfacing data about popular edits let's call it so, "popular edits". I think anyone would equal a trending api of wikipedia pages to "most-read-pages on our trending interval" (say, an hour) but this api will not provide such a data. cc @DarTar as we were talking about this recently.

Popular edits is also misleading. It is going to be doing lots of filtering based on things like how biased edits are, the edit quality and spam detection and it's also going to be consulting most read on the previous day.

I'm happy calling it "magic monkey api" or something neutral while we are developing this, if this is the only problem with this service, but the long term goal is to be a fully fledged trending service but that's not going to happen straight away.

On my opinion we cannot re-invent the definition of "trending" which is about Readership, not Editing.

Why is "trending" about readership?

I get the desire to make the name informative, so that if we develop other trending systems, implementors can be clear about what the API is providing. Although as long as the docs are clear, I +1 the magic monkey suggestion.

But I don't think "trending" is necessarily about readership, at least not in any broadly used way. Trending is a black box on most platforms, often involving other engagement signals (shares, comments, retweets, etc etc) and I don't think users give much thought to the precise choice of signals involved.

http://instagram-engineering.tumblr.com/post/122961624217/trending-at-instagram is a great read.
"Intuitively, a trending hashtag should be one that is being used more than usual, as a result of something specific that is happening in that moment."

We would do good to define what trending means to us and use that as a guide. It should be agnostic to how it's implemented under the hood but what it does.

We can call it interesting if this seems like a less loaded term.

I agree with @JMinor that trending is not just about readership.

For context: the reason why I brought this up is not just because bikeshedding on names is fun :) If down the line we plan to work on designing a proper trend extraction algorithm (e.g. via burst/anomaly detection), we need to make sure we don't have a production-level API endpoint or end user functionality with the same name, where in fact what they capture is "most/highly edited" or "most/highly viewed" articles. I would favor descriptive names too and I think the above suggestions – while less fancy than "trending" – might be more accurate and avoid name collisions.

I like this discussion!

Could you surface such an api for labs? I'd happily play with it to see how it fits into the big picture of trending and then if it works well we will have good justification for productionising! :)

It'd be difficult to build something useful in labs, since streaming pageview data doesn't exist there, or publicly. I'm at the ops offsite now, and things we are discussing are making me more optimistic about setting up a production stream processing service in the not too long term; MAYBE this FY, maybe a bit longer. Because we don't have this now, I'm not suggesting you try to incorporate pageviews into the trending algorithm at this moment. I just want it to be considered as something useful and possible.

Just curious, are you using windows to compute trending? Just googled around for a nodejs based 'stream processing' lib, found this, and its README at least makes it sound cool.

Trending is a black box on most platforms, often involving other engagement signals (shares, comments, retweets, etc etc) and I don't think users give much thought to the precise choice of signals involved.

Public metrics about trending Wikimedia things would be SUPER cool, so this is very exciting. If the intention of this project is to produce trending metrics, starting with trending edits, this sounds good. I'd even be fine with calling this 'trending articles' IFF it was documented how this metric is calculated, and how it may change in the future (e.g. pageviews may be used as input to the algorithm).

We want to avoid the situation we had in the past with the old pageviews (AKA pagecounts) metric. WMF has a history of haphazardly defining metrics and then releasing them to the public, without well documenting the definition of those metrics and when and how they change over time. The Analytics and Research team went through a big collaboration to strictly define what a 'Pageview' is, and also a process for making changes to this definition.

If we are going to release a product metric called 'trending articles', we should do the same and be as explicit and strict about what that metric is as we can.

Wrt to name bike-shedding, we can always mark the endpoint experimental to signal it would change/evolve over time, so the exact name shouldn't be a real issue here at this point.

What needs to be agreed upon, it seems to me, is whether this service will enrich its selection process over time or not. In other words, do we need/want two different services/APIs (interesting and trending) ? I could see them eventually converging into a single entity; a trending entity is one that is shared, talked about and edited, IMHO. Hence, I am not entirely sure we need two separate definitions - one for reading, the other for editing.

I don't see anybody looking for "trending articles" as a product metric. Its most obvious value is in surfacing trends to users, and I think it makes a lot of sense to iterate on the inputs & their weights to maximize this value, as proposed by @Jdlrobson.

@Jdlrobson this was brought up as a potentially better way to to consume the stream:

T130651

@GWicke is looking into getting this exposed internally so you can use it

@Jdlrobson also from the services sync meeting:

If we can limit the playback to be just an hour or 2 for the first version, we can get this deployed much easier.

If you need to sync with @mobrovac on this to make sure we are on the same page.

Attaching general notes form meeting:

Hour or 2 of playback only in first version (limited computation)
Meeting with Ops after timeline is decided
Potentially use Redis
Public event streams https://phabricator.wikimedia.org/T130651
Looks like this is going to deployed internally soon. Talk to Andrew Otto on using this internally before the big public deploy. Allows to pull data in, in case of restart of the trending service.

FYI I've started talking to @ovasileva about moving this along, so hopefully in the next 2-4 weeks you'll start seeing increased velocity on this project.

Repo has been requested on https://www.mediawiki.org/wiki/Git/New_repositories/Requests - not sure if anyone is able to approve that :)

2 thoughts:

  1. Since you are building an production service that has access to internal infrastructure, I wouldn't recommend using public EventStreams over just a good ol' Kafka consumer client.
  1. http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/ is a really great article on stream processing and application state. I am not suggesting that we use Kafka Streams (a Java Framework) to build this, but the concepts are relevant.

2 thoughts:

  1. Since you are building an production service that has access to internal infrastructure, I wouldn't recommend using public EventStreams over just a good ol' Kafka consumer client.

An upside of this would be higher efficiency for high volume processing. Edit events are fairly low volume though, so this probably doesn't matter too much.

A downside is the need for more manual handling of authentication, multi-DC topic merging, error handling, and kafka upgrades. Using a stable public eventstreams API avoids those issues, and decouples the service from the backend infrastructure. It also simplifies the development somewhat by enabling testing against regular public APIs.

2 thoughts:

  1. Since you are building an production service that has access to internal infrastructure, I wouldn't recommend using public EventStreams over just a good ol' Kafka consumer client.

Can you clarify what you mean by "internal infrastructure"? The current version is scoped to consuming the public pageviews api and public edits.

Since this first version might fail, keen to not over-engineer at this point and do whatever's simplest. T145553 is scheduled to be worked on soon and probably the best place to discuss this further.

Can you clarify what you mean by "internal infrastructure"?

Public EventStreams will be a public facing service, from which anyone can consume from. It is a dumbed down Kafka Consumer proxy, with fewer features and less resiliency.

Gabriel and I disagree on the advantages of using Public EventStreams vs. a Kafka client. In his list of disadvantages, the only one that I can sympathize with is 'kafka upgrades', although in the past this has been fairly seamless. Kafka has done a good job making things backwards compatible.

I guess all I mean is: Kafka > a custom built http proxy, if you can use Kafka directly (you can because you are internal), I don't see why you wouldn't want to.

I guess all I mean is: Kafka > a custom built http proxy, if you can use Kafka directly (you can because you are internal), I don't see why you wouldn't want to.

The trade-offs are those between using a published, versioned API vs. directly querying the underlying storage.

Apart from cases with special (performance) requirements, we have been generally moving away from directly querying databases in the interest of decreasing implementation coupling & code duplication, and increasing API stability.

If we were exposing a 'database', where the API call will finish when the database returns a result, I might agree. Streams are different, and a Kafka client gets you more resiliency and redundancy than this HTTP proxy. Jon will have write more custom client logic to deal with offsets and auto-resumes (and consumer balancing, if needed) to work with EventStreams. A Kafka client has all that built in.

Jon will have write more custom client logic to deal with offsets and auto-resumes (and consumer balancing, if needed) to work with EventStreams.

This surprises me. Why would a consumer of an HTTP API need to worry about details like balancing? The API will expose a single logical topic, merged between DCs. Regarding reconnects: The SSE clients I have seen implement the regular spec reconnection behavior, and once we implement time-based resumption in the API (not trivial with Kafka 0.9, but easier with 0.10), then there isn't even a need to store anything on the client for the "catch up" use case after restarts.

If the service consuming the stream restarts, how will it know where it left off? Where are the offsets (or timestamps) stored, and what is the code that will look that up?

Kafka clients have this feature built in, even if the consumer process re-spawns on a different host.

The code has not been written yet but in an ideal world:

  • Trending service gets notification/event about an edit.
  • Trending service processes this event and updates a local variable describing the current state of the world.
  • Trending service stores "notable events" internally for future reference by apis.

When stream restarts:

  • Trending restores most recent known state OR trending replays last hour of edits (if all we care about is real time updates)

I should note if we miss 5 minutes of events it's not the end of the world for the first service as by nature its meant to be ephemeral (apps are going to only show recent "trend" events if available)

Personally, from a development point of view, I'd love to avoid replaying edits at all if possible and just store a JSON blob with a timestamp.

I don't really care what we use, but I'd rather not have to worry about OAuth authentication on the first version, even if the trade-off is a little accuracy, given the goal is a minimum viable product to get user feedback on and evaluate not something we'll have to maintain for years to come.

Just tell us where and how to get these edits from since RCStream doesn't seem to be suitable to use.

I'd rather not have to worry about OAuth authentication on the first version

We don't have any authentication on Kafka yet. At the moment your service can just consume. Proper auth will come in the future, but we aren't yet sure when. So, for first version, you wouldn't have to deal with it.

I'd love to avoid replaying edits at all if possible and just store a JSON blob with a timestamp.

If you use a Kafka client, and set auto commit interval to N seconds (this is on be default every 5 seconds), you don't have to worry about storing anything. When you restart, the Kafka client will read its last committed offsets from Kafka, and start from there.

Just tell us where and how to get these edits from since RCStream doesn't seem to be suitable to use.

In either case, it will be the [[ https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/revision/create/1.yaml | mediawiki.revision-create ]] events.

This will be coded in nodejs, right? Example consumer usage would be:

var kafka = require('node-rdkafka');

var topics = [
    'eqiad.mediawiki.revision-create',
    'codfw.mediawiki.revision-create'
];

var consumer = new kafka.KafkaConsumer({
    'group.id': 'trending-edits-service', // or whatever
    'metadata.broker.list': 'localhost:9092', // comma separated list of kafka brokers
});

var revisionStream = consumer.getReadStream(topics);

revisionStream.on('data', function(kafkaMessage) {
  var revision_create_event = JSON.parse(kafkaMessage.value.toString());
  console.log("Revision created: ", revision_create_event);  
});

Other examples (and consume modes) are here. (Note: My example is based changes in latest master, which is not yet versioned.)

(Subtasks are kept there - not epic which is project specific)

Jhernandez subscribed.

Seems like things went on a different route in the end. Reflecting reality ☝