Add ORES articlequality data to summaries?
Closed, DeclinedPublic
Actions

Description

The summary endpoint is a great way to get metadata about a page that is useful to display to users to help them decide if they would like to read an article.

ORES articlequality scores convey the quality of an article based on the current latest revision. Reading is currently evaluating how to show the ORES score to users.

Just like wikidata description and thumb, we expect this to become a basic piece of data that clients may want to display in many other contexts.

This leads to the question: Is it feasible to return ORES data in the summary of articles?

How would this affect caching / cache invalidation / storage / CPU?

This ticket is intentionally very similar to:
T157068 and T157061

Basically the thrust of these tickets is enrichment of summary data

Example data to be merged:
https://ores.wmflabs.org/scores/enwiki/wp10/720618545/

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T157132 Add ORES articlequality data to summaries?
		Resolved		Tgr	T153321 [Spike] Review ORES architecture for Reading Product plans

Event Timeline

• Fjalapeno created this task.Feb 3 2017, 3:09 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 3 2017, 3:09 PM

• Fjalapeno mentioned this in T145829: Trending API should consult ORES.Feb 3 2017, 3:18 PM

It is certainly feasible to do that. The only potential problem I see is the possibility of races between ORES preaching update and summary query. The problem here is that on every edit ChangeProp makes a request to ORES to warm up the cache. Also, it re renders summary in parallel. So it is possible that the request for the summary would come before the precaching request, so ORES load would duplicate.

We don't really have any way of coordinating such parallel things, have to think about this.

• Pchelolo moved this task from later to designing on the Services board.Feb 3 2017, 10:21 PM

• Pchelolo edited projects, added Services (designing); removed Services (later).

Hmmm, I think we might be mixing apples and oranges here. ORES computes scores for the revision diff, not for the revision content, and as such I don't think it belongs with the summary endpoint, which has no notion of revisions.

@mobrovac We're talking about special model in ORES that evaluates the article quality as a whole, not the quality of the diff between two revisions. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/wp10 for more details.

• bearND moved this task from Incoming to Open Questions on the Mobile-Content-Service board.Feb 4 2017, 3:07 AM

In T157132#2997281, @Pchelolo wrote:

So it is possible that the request for the summary would come before the precaching request, so ORES load would duplicate.

IIRC ORES uses a web worker queue to deduplicate requests so the only thing that would double is the queue length which is unlikely to be a big deal.

In T157132#2997984, @mobrovac wrote:

Hmmm, I think we might be mixing apples and oranges here. ORES computes scores for the revision diff, not for the revision content, and as such I don't think it belongs with the summary endpoint, which has no notion of revisions.

In the end everything in the summary endpint is revision content (for the last revision of the article). Even having the damaging score there would make a lot of sense as it can tell you when you are about to return vandalized content in the summary.

The two questions that IMO need to be addressed for this IMO are how to cache the data (even if ORES has its own cache, I doubt hitting it on every pageview would scale well; and in any case we are already duplicating caching architecture between Varnish and RESTBase, we really shouldn't get into triplicating it), and how to handle mass invalidation when ORES models change.

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptFeb 4 2017, 8:37 PM

@mobrovac We're talking about special model in ORES that evaluates the article quality as a whole, not the quality of the diff between two revisions. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/wp10 for more details.

@Pchelolo @mobrovac this is correct. The wp10 is a measure of the quality of the article itself not the revision

The two questions that IMO need to be addressed for this IMO are how to cache the data

Summary content is stored in RESTBase and is re-rendered on every edit asynchronously.

How to handle mass invalidation when ORES models change.

We have a concept of content-type version. When the version is bumped, all old content is disregarded and re-rendered from scratch automatically. We could use that for ORES model updates - bump a patch version of the content type.

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

From @Halfak : documents how long it took to rescore all of Wikipedia: T135684

Halfak edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Feb 6 2017, 5:38 PM

Halfak moved this task from Parked to Monitor (long term) on the Machine-Learning-Team (Active Tasks) board.

Halfak added a subtask: T153321: [Spike] Review ORES architecture for Reading Product plans.Feb 6 2017, 5:48 PM

@Fjalapeno: Do we have information on median / p99 score times?

@Halfak do you have data on @GWicke's question?

Not handy, but this graph should be close-ish. One big difference is that we won't need to wait for the API to respond when working from the XML dumps.

https://grafana.wikimedia.org/dashboard/db/ores?panelId=15&fullscreen ~1.15s median and ~2.5s p99.

In T157132#3003250, @Halfak wrote:

Not handy, but this graph should be close-ish. One big difference is that we won't need to wait for the API to respond when working from the XML dumps.

https://grafana.wikimedia.org/dashboard/db/ores?panelId=15&fullscreen ~1.15s median and ~2.5s p99.

Thanks for the pointer! I think the second is the p95 value; over the last 90 days, p99 averaged 9.35s.

In any case, those numbers look reasonable to me, especially for something primarily driven by async updates. The issue of latency seems more pressing in the context of the annotated recent changes feed, but even there a delay of a couple of seconds might be tolerable.

• Fjalapeno added a project: Reading Epics.Feb 6 2017, 11:00 PM

We could use that for ORES model updates - bump a patch version of the content type.

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

@Pchelolo spoke with @Halfak about this earlier and some thoughts that may help with handling model changes:

The wp10 model is pretty stable at this point and not updated often/significantly
When the model is updated, we can run a job to update all the data in a batch (which will take < 1 day)
It's "ok" if the summaries aren't updated immediately since the values aren't likely to change significantly

• Fjalapeno updated the task description. (Show Details)Feb 7 2017, 1:34 AM

In T157132#3002481, @Pchelolo wrote:

We have a concept of content-type version. When the version is bumped, all old content is disregarded and re-rendered from scratch automatically. We could use that for ORES model updates - bump a patch version of the content type.

Triggering cache invalidation for tens of millions of pages simultaneously seems a bit scary. Has that been tested in practice?

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

Invalidating the Varnish cache for tens of millions of pages simultaneously is definitely scary. XKey probably can't handle that amount; manual bans could, but it would crush the backend.

In any case, this could be handled by a backend job, according to Aaron (go through all pages, score them with the new model, and as a page is done tell RESTBase to refetch the data). That sounds a lot easier than doing some sort of gradual invalidation on the RESTBase side.

Worst case, if we wouldn't add any invalidation logic anywhere, now scores would appear within 30 days duie to normal invalidation, which is still not tragic, so I suppose this is not really a concern.

The findings here may inform: T157222

Triggering cache invalidation for tens of millions of pages simultaneously seems a bit scary. Has that been tested in practice?

To clarify, the versioning is actually set up to avoid the need to invalidate all caches at once. Clients are expected to migrate gradually, we vary on Accept header values, and we can implement upgrades as efficiently as possible (not necessarily a re-render from scratch).

For model updates it should be a lot more efficient to update just the ORES score. We can do that on demand, triggered by clients asking for a newer (minor) content version, or with a background job doing the same. Re-doing the scoring does not look like a breaking change to the summary end point overall, so it doesn't look like it would require a major version bump.

If I understand correctly, an updated scoring model shouldn't require even a minor semver bump on the RESTBase endpoint, at least as long as the ORES JSON key-value structure is the same, right? I might be misunderstanding the nature of what it is we're referring to when asking for a "newer (minor) content version", though. @GWicke, would you please clarify?

• Fjalapeno mentioned this in T143743: Set up the foundation for the ReviewStream feed.Feb 10 2017, 9:24 PM

@JAllemandou FYI this is one of the use-cases of applying the wp10 "article quality model" to revision texts using hadoop/spark/etc.

Halfak added a subscriber: JAllemandou.Feb 20 2017, 5:18 PM

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

In T157132#3042160, @Tgr wrote:

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

On the Reading side, no. It does, however, imply a higher volume of work for ORES so that should be mentioned/stated in the linked tickets.

Seems reasonable to me

In T157132#3005228, @dr0ptp4kt wrote:

If I understand correctly, an updated scoring model shouldn't require even a minor semver bump on the RESTBase endpoint, at least as long as the ORES JSON key-value structure is the same, right?

It doesn't require a version bump, but depending on how drastic the scoring changes are we might still elect to bump minor versions even without changes to the result format schema. This gives us a way to force re-renders where this makes sense.

In T157132#3042160, @Tgr wrote:

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

The effort needed on the services side looks very manageable, below anything we need to call out for annual planning.

Thanks, @GWicke, for the clarification.

I don't think ORES will have any serious capacity concerns for this. We already generate scores for every edit as it is saved. Generating the "wp10" score on top of the "damaging" score requires trivial (effectively zero) additional resources.

I don't think ORES will have any serious capacity concerns for this. We already generate scores for every edit as it is saved. Generating the "wp10" score on top of the "damaging" score requires trivial (effectively zero) additional resources.

Awesome. There's one more issue for us though: ORES takes the revid as input, and upon request to RESTBase summary endpoint we only know the page title, not the latest revision id. We can obtain the latest rev ID either from Cassandra, or from MW API, but that means we can't request ORES scores in parallel with getting other data effectively duplicating the endpoint latency. It's not a big deal, since it's happening on background updates and normal usages will be served from storage, but it's still not ideal.

@Halfak How do you think, can we add an alternative way to request wp10 score from ORES by page title instead of a revision ID?

@Pchelolo, just to clarify, you want ORES to send a request to api.php and look up the most recent rev_id for a give page_title and then generate a score for you? I'm not sure that this would be a good thing for ORES to support.

@Pchelolo, just to clarify, you want ORES to send a request to api.php and look up the most recent rev_id for a give page_title and then generate a score for you? I'm not sure that this would be a good thing for ORES to support.

I'm not particularly sure I completely understand how ORES works, but with my limited understanding for the wp10 model it fetches the whole content of the article and runs scoring on it, not on a diff between revisions, right? If so, fetching the content from the MW API might be done by revision id (what's done currently) or by the page title (what we want here).

Again, as I've stated before, this is a minor optimization for background updates in summary endpoint in RESTBase, so if it's a major inconvenience on the ORES side we can totally live with the current APIs and be completely fine, but if I understand the wp10 score correctly, it's a measurement of an article, not of a particular revision, so even from the API design perspective, being able to access it based on the article title (or page id) seems nice and appropriate.

wp10 scores revisions; damaging etc. score diffs between subsequent revisions. (Conceptually, anyway. Technically, they still assign the score to a revision; every revision has a single parent so there is a one-one relationship between revisions and diffs, absent some edge cases.) When people talk about a page/article/title they almost always actually mean a revision (it rarely makes sense otherwise as you could not make any assumptions about the content of the page).

ORES could assume you are talking about the latest revision and return the score for that, but that complicates the logic (you would have to verify you are getting the revision you expected, and re-query ORES if not, to avoid having inconsistent data in the summary; plus ORES would have to deal with cache invalidation while right now scores can only be invalidated by ORES maintenance actions) and seems like a premature micro-optimization IMO.

Thanks for the context @Tgr. Sounds convincing that it's better to do it in two steps in RESTBase instead of mangling ORES API.

Tgr mentioned this in T153321: [Spike] Review ORES architecture for Reading Product plans.Mar 6 2017, 10:28 AM

Tgr closed subtask T153321: [Spike] Review ORES architecture for Reading Product plans as Resolved.

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:30 PM

• NHarateh_WMF moved this task from Needs triage to Needs investigation on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:30 PM

• NHarateh_WMF moved this task from Open Questions to Backlog on the Mobile-Content-Service board.Apr 25 2017, 4:29 PM

• NHarateh_WMF moved this task from Backlog to Incoming on the Mobile-Content-Service board.Apr 25 2017, 4:35 PM

• Fjalapeno moved this task from Triage to Page Previews on the Reading Epics board.Jun 6 2017, 2:41 PM

• Fjalapeno edited projects, added Reading Epics (Page Previews); removed Reading Epics.

• Fjalapeno mentioned this in T167180: Emit revision-score event to EventBus and expose in EventStreams.Jul 11 2017, 3:23 PM

• Fjalapeno mentioned this in T169761: Review Summary 2.0 Spec.Jul 11 2017, 3:46 PM

phuedx subscribed.Jul 11 2017, 3:48 PM