Page MenuHomePhabricator

Add ORES articlequality data to summaries?
Closed, DeclinedPublic

Description

The summary endpoint is a great way to get metadata about a page that is useful to display to users to help them decide if they would like to read an article.

ORES articlequality scores convey the quality of an article based on the current latest revision. Reading is currently evaluating how to show the ORES score to users.

Just like wikidata description and thumb, we expect this to become a basic piece of data that clients may want to display in many other contexts.

This leads to the question: Is it feasible to return ORES data in the summary of articles?

How would this affect caching / cache invalidation / storage / CPU?

This ticket is intentionally very similar to:
T157068 and T157061

Basically the thrust of these tickets is enrichment of summary data

Example data to be merged:
https://ores.wmflabs.org/scores/enwiki/wp10/720618545/

Event Timeline

Pchelolo subscribed.

It is certainly feasible to do that. The only potential problem I see is the possibility of races between ORES preaching update and summary query. The problem here is that on every edit ChangeProp makes a request to ORES to warm up the cache. Also, it re renders summary in parallel. So it is possible that the request for the summary would come before the precaching request, so ORES load would duplicate.

We don't really have any way of coordinating such parallel things, have to think about this.

Hmmm, I think we might be mixing apples and oranges here. ORES computes scores for the revision diff, not for the revision content, and as such I don't think it belongs with the summary endpoint, which has no notion of revisions.

@mobrovac We're talking about special model in ORES that evaluates the article quality as a whole, not the quality of the diff between two revisions. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/wp10 for more details.

So it is possible that the request for the summary would come before the precaching request, so ORES load would duplicate.

IIRC ORES uses a web worker queue to deduplicate requests so the only thing that would double is the queue length which is unlikely to be a big deal.

Hmmm, I think we might be mixing apples and oranges here. ORES computes scores for the revision diff, not for the revision content, and as such I don't think it belongs with the summary endpoint, which has no notion of revisions.

In the end everything in the summary endpint is revision content (for the last revision of the article). Even having the damaging score there would make a lot of sense as it can tell you when you are about to return vandalized content in the summary.

The two questions that IMO need to be addressed for this IMO are how to cache the data (even if ORES has its own cache, I doubt hitting it on every pageview would scale well; and in any case we are already duplicating caching architecture between Varnish and RESTBase, we really shouldn't get into triplicating it), and how to handle mass invalidation when ORES models change.

@mobrovac We're talking about special model in ORES that evaluates the article quality as a whole, not the quality of the diff between two revisions. See https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service/wp10 for more details.

@Pchelolo @mobrovac this is correct. The wp10 is a measure of the quality of the article itself not the revision

The two questions that IMO need to be addressed for this IMO are how to cache the data

Summary content is stored in RESTBase and is re-rendered on every edit asynchronously.

How to handle mass invalidation when ORES models change.

We have a concept of content-type version. When the version is bumped, all old content is disregarded and re-rendered from scratch automatically. We could use that for ORES model updates - bump a patch version of the content type.

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

From @Halfak : documents how long it took to rescore all of Wikipedia: T135684

@Fjalapeno: Do we have information on median / p99 score times?

Not handy, but this graph should be close-ish. One big difference is that we won't need to wait for the API to respond when working from the XML dumps.

https://grafana.wikimedia.org/dashboard/db/ores?panelId=15&fullscreen ~1.15s median and ~2.5s p99.

Not handy, but this graph should be close-ish. One big difference is that we won't need to wait for the API to respond when working from the XML dumps.

https://grafana.wikimedia.org/dashboard/db/ores?panelId=15&fullscreen ~1.15s median and ~2.5s p99.

Thanks for the pointer! I think the second is the p95 value; over the last 90 days, p99 averaged 9.35s.

In any case, those numbers look reasonable to me, especially for something primarily driven by async updates. The issue of latency seems more pressing in the context of the annotated recent changes feed, but even there a delay of a couple of seconds might be tolerable.

We could use that for ORES model updates - bump a patch version of the content type.

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

@Pchelolo spoke with @Halfak about this earlier and some thoughts that may help with handling model changes:

  1. The wp10 model is pretty stable at this point and not updated often/significantly
  2. When the model is updated, we can run a job to update all the data in a batch (which will take < 1 day)
  3. It's "ok" if the summaries aren't updated immediately since the values aren't likely to change significantly

We have a concept of content-type version. When the version is bumped, all old content is disregarded and re-rendered from scratch automatically. We could use that for ORES model updates - bump a patch version of the content type.

Triggering cache invalidation for tens of millions of pages simultaneously seems a bit scary. Has that been tested in practice?

Massive Varnish purges are not supported currently, so the only option is waiting for it to fall out of the cache naturally. When XKey support is added I could envision adding an XKey: Summary to the response and then mass-purging all summaries with that key on model change.

Invalidating the Varnish cache for tens of millions of pages simultaneously is definitely scary. XKey probably can't handle that amount; manual bans could, but it would crush the backend.

In any case, this could be handled by a backend job, according to Aaron (go through all pages, score them with the new model, and as a page is done tell RESTBase to refetch the data). That sounds a lot easier than doing some sort of gradual invalidation on the RESTBase side.

Worst case, if we wouldn't add any invalidation logic anywhere, now scores would appear within 30 days duie to normal invalidation, which is still not tragic, so I suppose this is not really a concern.

Triggering cache invalidation for tens of millions of pages simultaneously seems a bit scary. Has that been tested in practice?

To clarify, the versioning is actually set up to avoid the need to invalidate all caches at once. Clients are expected to migrate gradually, we vary on Accept header values, and we can implement upgrades as efficiently as possible (not necessarily a re-render from scratch).

For model updates it should be a lot more efficient to update just the ORES score. We can do that on demand, triggered by clients asking for a newer (minor) content version, or with a background job doing the same. Re-doing the scoring does not look like a breaking change to the summary end point overall, so it doesn't look like it would require a major version bump.

If I understand correctly, an updated scoring model shouldn't require even a minor semver bump on the RESTBase endpoint, at least as long as the ORES JSON key-value structure is the same, right? I might be misunderstanding the nature of what it is we're referring to when asking for a "newer (minor) content version", though. @GWicke, would you please clarify?

@JAllemandou FYI this is one of the use-cases of applying the wp10 "article quality model" to revision texts using hadoop/spark/etc.

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

On the Reading side, no. It does, however, imply a higher volume of work for ORES so that should be mentioned/stated in the linked tickets.

If I understand correctly, an updated scoring model shouldn't require even a minor semver bump on the RESTBase endpoint, at least as long as the ORES JSON key-value structure is the same, right?

It doesn't require a version bump, but depending on how drastic the scoring changes are we might still elect to bump minor versions even without changes to the result format schema. This gives us a way to force re-renders where this makes sense.

To wrap up the annual planning aspect of this, does everyone agree that implementing this task is relatively straightforward and does not require any hardware purchase (the possible issues in T157222/T146664 notwithstanding) or major involvement from the ORES or Services teams?

The effort needed on the services side looks very manageable, below anything we need to call out for annual planning.

I don't think ORES will have any serious capacity concerns for this. We already generate scores for every edit as it is saved. Generating the "wp10" score on top of the "damaging" score requires trivial (effectively zero) additional resources.

I don't think ORES will have any serious capacity concerns for this. We already generate scores for every edit as it is saved. Generating the "wp10" score on top of the "damaging" score requires trivial (effectively zero) additional resources.

Awesome. There's one more issue for us though: ORES takes the revid as input, and upon request to RESTBase summary endpoint we only know the page title, not the latest revision id. We can obtain the latest rev ID either from Cassandra, or from MW API, but that means we can't request ORES scores in parallel with getting other data effectively duplicating the endpoint latency. It's not a big deal, since it's happening on background updates and normal usages will be served from storage, but it's still not ideal.

@Halfak How do you think, can we add an alternative way to request wp10 score from ORES by page title instead of a revision ID?

@Pchelolo, just to clarify, you want ORES to send a request to api.php and look up the most recent rev_id for a give page_title and then generate a score for you? I'm not sure that this would be a good thing for ORES to support.

@Pchelolo, just to clarify, you want ORES to send a request to api.php and look up the most recent rev_id for a give page_title and then generate a score for you? I'm not sure that this would be a good thing for ORES to support.

I'm not particularly sure I completely understand how ORES works, but with my limited understanding for the wp10 model it fetches the whole content of the article and runs scoring on it, not on a diff between revisions, right? If so, fetching the content from the MW API might be done by revision id (what's done currently) or by the page title (what we want here).

Again, as I've stated before, this is a minor optimization for background updates in summary endpoint in RESTBase, so if it's a major inconvenience on the ORES side we can totally live with the current APIs and be completely fine, but if I understand the wp10 score correctly, it's a measurement of an article, not of a particular revision, so even from the API design perspective, being able to access it based on the article title (or page id) seems nice and appropriate.

wp10 scores revisions; damaging etc. score diffs between subsequent revisions. (Conceptually, anyway. Technically, they still assign the score to a revision; every revision has a single parent so there is a one-one relationship between revisions and diffs, absent some edge cases.) When people talk about a page/article/title they almost always actually mean a revision (it rarely makes sense otherwise as you could not make any assumptions about the content of the page).

ORES could assume you are talking about the latest revision and return the score for that, but that complicates the logic (you would have to verify you are getting the revision you expected, and re-query ORES if not, to avoid having inconsistent data in the summary; plus ORES would have to deal with cache invalidation while right now scores can only be invalidated by ORES maintenance actions) and seems like a premature micro-optimization IMO.

Thanks for the context @Tgr. Sounds convincing that it's better to do it in two steps in RESTBase instead of mangling ORES API.

@Pchelolo, can we pass the revision in ChangeProp requests?

GWicke triaged this task as Medium priority.Aug 8 2017, 9:08 PM
awight renamed this task from Add ORES WP10 data to summaries? to Add ORES articlequality data to summaries?.Sep 26 2018, 6:47 PM
awight updated the task description. (Show Details)
elukey subscribed.

We are moving to Lift Wing: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing

I am closing old tasks related to ORES since it is being deprecated, please re-open if you feel that any work could be done on Lift Wing.