Page MenuHomePhabricator

RfC: OpenGraph descriptions in wiki pages
Open, Needs TriagePublic

Description

OpenGraph has become an universal standard for providing a information suitable for showing a small preview of a webpage, used by a variety of social media, chat software, and search engines. With PageImages there is decent support for og:image; no similar support for og:description exists. The lack of support for decent page descriptions is a painful deficiency in MediaWiki, affecting both our internal tools (e.g. T185017) and our ability to share our content with the world (e.g. T142090). There is interest in WMF Readers in changing that.

Some extensions exist for this purpose (e.g. OpenGraphMeta) but they all assume the description is manually provided, which doesn't scale. The TextExtracts extension can provide automatic page descriptions, and it would be straightforward to display that, but the quality is not great, as the WMF doesn't really maintain the extension anymore, and has opted for a summary service (part of the (Page Content Service) instead, which has significantly more complicated logic for dealing dealing with templates, parantheses with lots of low-relevance data in the first sentence, and similar intricacies of Wikipedia articles (non-Wikipedias are not supported). The difference is substantial enough that only the logic used in PCS is acceptable for OpenGraph summaries.

This leaves two potential approaches: port the relevant part of PCS to PHP, or figure out how to include output from an external service in the HTML of a page. Neither of those options look great. The goal of the RfC is to determine which one is acceptable / preferable, and whether better alternatives exist. Also feedback from third party wikis on how they would generate text would be valuable (most people probably don't want to rely on PCS/RESTBase, and it's fairly Wikipedia-specific anyway; what would be the best level to abstract it away? e.g. should we fall back to TextExtracts?)

Porting the summary logic in the Page Content Service to MediaWiki

The code that would have to be ported is fairly small, but right now it is not time-sensitive, uses DOM traversal liberally, and takes long for very large pages, while as a part of the parsing/rendering it would have to finish quickly even for a big article. Also, the input would have to be the HTML rendered by the PHP parser instead of Parsoid, which might maybe cause problems. So this would probably be a major rewrite effort and we'd end up doing the same thing in entirely different ways in MediaWiki and PCS and would have a double maintenance burden for description-generation code.

See T214000: Evaluate difficulty of porting PCS summary logic to PHP for details.

Using Page Content Service data in MediaWiki page HTML

Page Content Service uses Parsoid HTML (and to a smaller extent, MediaWiki APIs) as input; Parsoid uses the wikitext provided by the MediaWiki API. So when an edit happens, it needs to be published in MediaWiki, then processed in Parsoid, then processed in PCS. That's too slow for MediaWiki HTML rendering which is typically invoked immediately after an edit (since the editing user gets redirected back to the page). So a naive approach of just querying PCS from MediaWiki when a page is rendered wouldn't work.

On the other hand, the description is used by the sharing functionality of social media sites, which is triggered on demand, and maybe to a small extent by web crawlers, which might be triggered by an edit, but probably not within seconds. So if the description is wrong or missing for a short time, that should not be a big deal. That means we can use the following strategy when rendering a page:

  • Look up the description in some local cache (see below).
  • If that failed, query PCS.
    • If it gives a fast response and the revision ID matches, use the description it returned, and cache it.
    • If it gives a 404, or takes too much time, or the response has an older revision ID, use some kind of fallback (the outdated description returned by the service, or TextExtracts, or simply the empty string), and schedule an update job.
    • If the response has a newer revision ID, use a fallback or set the description to empty. We are in an old revision, the description probably won't matter for any real-world use case. (FlaggedRevs and revision deletion might complicate this, though. See T163462 and T203835.)
  • The update job ensures some small delay (maybe reschedules itself a few times if needed, although hopefully there's a cleaner way), then fetches the PCS description, caches it and purges the page from Varnish.

This would make page generation non-deterministic and hard to debug or argue about, and create a cyclic dependency between MediaWiki and PCS (as PCS relies on Parsoid which relies on MediaWiki APIs).

Other options considered

  • Add a "functional" mode to the PCS summary endpoint, where it takes all data (or at least the wikitext) from the request, uses that data to get the Parsoid HTML (Parsoid already has a "functional" endpoint, or at least an approximation that's close enough for our purposes), and processes that to get the description. This is too slow to be acceptable during page rendering (p99 latency of PCS is tens of seconds). Although it might be used together with the job queue option to remove the cyclic dependency, if that's a major concern.

Related Objects

Event Timeline

Tgr created this task.Jan 11 2019, 12:30 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 11 2019, 12:30 AM
Tgr updated the task description. (Show Details)Jan 22 2019, 9:32 AM
Tgr updated the task description. (Show Details)Jan 24 2019, 11:30 PM
Tgr renamed this task from [DRAFT] RfC: OpenGraph descriptions in wiki pages to RfC: OpenGraph descriptions in wiki pages.Jan 24 2019, 11:34 PM
Tgr updated the task description. (Show Details)
Tgr updated the task description. (Show Details)Jan 24 2019, 11:40 PM

Another alternative could be to inject the tags at the edge, before the HTML is cached. The rough workflow could be something like:

  • a page is edited
  • the parser reparses it
  • Parsoid/PCS reconstruct their parts
  • when a page is requested, if the HTML is present at the edge cache, serve it; otherwise ask for the HTML and summary and inject the needed tags.

In this way, we keep the current summary API while integrating it into the final HTML delivered to clients.

Tgr added a comment.Feb 5 2019, 9:41 PM

@mobrovac: my impression was Ops is against edge composition in general. Also, this would solve the language issue and the cyclic dependency but not get rid of the speed issue. We'd either have to make a "fast" version of the summary API that has some guaranteed response speed and returns with a placeholder if needed, or have an aggressive timeout in the edge logic and apply a placeholder there (which probably means an empty placeholder - Varnish can't do something like "use the description for the previous revision if the current one is not available yet"; I imagine within PCS it would be doable).

Compared to the job queue approach, it's not clear how we'd guarantee eventual correctness of the cached HTML (important for vandalism fixes). Maybe changeprop could just purge the edge cache when the summary endpoint is done updating?

mobrovac moved this task from Inbox to Under discussion on the TechCom-RFC board.Feb 6 2019, 9:35 PM
Joe added a subscriber: Joe.Feb 7 2019, 9:47 AM
Joe added a comment.EditedFeb 8 2019, 6:58 AM

@mobrovac: my impression was Ops is against edge composition in general. Also, this would solve the language issue and the cyclic dependency but not get rid of the speed issue. We'd either have to make a "fast" version of the summary API that has some guaranteed response speed and returns with a placeholder if needed, or have an aggressive timeout in the edge logic and apply a placeholder there (which probably means an empty placeholder - Varnish can't do something like "use the description for the previous revision if the current one is not available yet"; I imagine within PCS it would be doable).
Compared to the job queue approach, it's not clear how we'd guarantee eventual correctness of the cached HTML (important for vandalism fixes). Maybe changeprop could just purge the edge cache when the summary endpoint is done updating?

We're not against the use of edge composition per-se, but:

  • We're definitely against moving any remotely complex logic to it
  • I don't think it solves this specific problem
  • Implementation of any form of edge composition is dependent on the transition to ATS which is going to take a few more quarters.

As it stands, it's pretty clear to me that right now we can't build an healthy architecture for this feature. Probably @Tgr's proposal of opportunistically query PCS from MediaWiki is the best way to go, with a small caveat: we already do update content in MCS/PCS via changeprop for every edit AIUI.

Once the parsoid parser has been moved within MediaWiki, I can see PCS becoming a purely functional transformation service.

A few more caveats:

  • Dependency cycles should be avoided as much as possible (mw calls PCS which calls mw via a couple chains of calls)
  • Do not use Restbase to mediate calls between services - services should have their own cache, not relying on a ill-advised pattern we've used for too long (putting the cache of a service outside of said service control).

One last comment: I don't think porting PCS to php makes sense unless either of the following conditions is met:

  • we also have parsoid in PHP
  • we can avoid using parsoid-generated HTML
Tgr added a comment.Feb 8 2019, 9:33 PM

we already do update content in MCS/PCS via changeprop for every edit AIUI.

Yeah. That's not directly usefule here as it is slow (PCS updates can take a minute for outliers, the editor needs to see the HTML almost immediately) other than that we should not needlessly duplicate it. I'm assuming request coalescing in Varish would take care of that.

  • Dependency cycles should be avoided as much as possible (mw calls PCS which calls mw via a couple chains of calls)

Not sure how that's possible here, other than using a large enough delay and hoping that PCS already processed and cached the summary by then. But then, MediaWiki page save -> changeprop -> PCS -> Parsoid -> MediaWiki API is already a cycle so this doesn't seem to introduce anything new.

  • Do not use Restbase to mediate calls between services - services should have their own cache, not relying on a ill-advised pattern we've used for too long (putting the cache of a service outside of said service control).

That seems like an orthogonal issue. Right now everything else uses RESTBase and services do not have their own cache - that does not seem to be any more problematic here than in general, so I think it's fine for this functionality to use RESTBase as well. Once caching is moved into PCS, it will be trivial to switch to using that.

One last comment: I don't think porting PCS to php makes sense unless either of the following conditions is met:

  • we also have parsoid in PHP
  • we can avoid using parsoid-generated HTML

Yeah. Using the old parser and rewriting a bunch of processing logic to account for that (and use some one-off extension to mark up the output of certain templates since Parsoid does that and the old parser doesn't) would be the desperation option. But I'd rather avoid it, use the service, and probably port to PHP eventually once Parsoid has settled.

Yann added a subscriber: Yann.Mar 8 2019, 9:23 PM

Quick question: what's the plan for this from a product point of view? Is it slated for implementation soon? Knowing this will help us plan how/when to discuss.

@JKatzWMF @JMinor would you support us addressing T142090: Add hover-card like summary (og:description) to open graph meta data printing plain summary as time permits here in Q4 or perhaps as a Q1 FY 19-20 project, to bridge the gap between now and bigger social sharing work?

@Tgr what do you think the level of effort is for a first version of this?

daniel added a subscriber: daniel.Apr 11 2019, 9:26 AM

If I understand correctly, the complexity of using OpenGraph arises from the need to embed the extracted description directly in the HTML output of the rendered page. Is there something in OpenGraph that would allow us to add a level of indirection? Something like <link rel="opengraph" ...> or something, that we could use to point to the OpenGraph description, instead of embedding it?

If this is not possible, perhaps OpenGraph could use the same approach as T212189: New Service Request: Wikidata Termbox SSR: trigger generation upon save, but only pull in the generated content (from an external service) when a full HTML page is served from index.php, and inject the headers on the fly. If the extracted description is not ready yet, the page could be served without it (but marked uncacheable), or the output could wait.

Another note: the description says that porting the PCS logic to PHP is complicated by the fact that then, it would have to operate on top of the built-in MediaWiki parser. That problem should vanish with Parsoid-PHP.

Tgr added a comment.Apr 12 2019, 3:46 AM

Quick question: what's the plan for this from a product point of view? Is it slated for implementation soon? Knowing this will help us plan how/when to discuss.

Depends on how much effort it turns out to be. I was hoping to work on it in Q4 but since writing the RfC picked up other commitments so that seems unlikely now but I hope to find time for it in Q1.

@Tgr what do you think the level of effort is for a first version of this?

Assuming we go with the jobqueue version, probably a week or less? That does not include FlaggedRevs, I'm not quite sure what to do about that, unless RESTBase starts supporting it in the meantime.

Is there something in OpenGraph that would allow us to add a level of indirection? Something like <link rel="opengraph" ...> or something, that we could use to point to the OpenGraph description, instead of embedding it?

I spent some time looking for that when I wrote the RfC but didn't find anything.

If this is not possible, perhaps OpenGraph could use the same approach as T212189: New Service Request: Wikidata Termbox SSR: trigger generation upon save, but only pull in the generated content (from an external service) when a full HTML page is served from index.php, and inject the headers on the fly. If the extracted description is not ready yet, the page could be served without it (but marked uncacheable), or the output could wait.

That would mean that between the edit and the extract becoming available all requests would hit MediaWiki and would have to wait for MediaWiki to hit RESTBase. Not great.

Another note: the description says that porting the PCS logic to PHP is complicated by the fact that then, it would have to operate on top of the built-in MediaWiki parser. That problem should vanish with Parsoid-PHP.

Which is estimated to take a year and a half; ideally we'd want this sooner.

Jheald added a subscriber: Jheald.Jun 6 2019, 10:14 AM

With the latest release of the page content service, I'm finding myself wishing I could use that rather than the PHP MobileFormatter class and all it's complicated (and buggy) code in MobileFrontend. The page content service seems to do everything that does and more (and has a whole team maintaining it). If the solution here is to port the summary HTML service to PHP it means at some point when faced with the same problem we'll likely need to port the even larger page content service to PHP. These are complicated services and non-trivial to port and I worry if this becomes the de-facto solution here it's going to slow down any work to simplify the mobile stack significantly. Please consider this when making a decision with the summary endpoint - other endpoints are likely to follow.

Tgr added a comment.Thu, Jul 11, 3:07 PM

The issue here is that the cached HTML would have to be updated at some later point, when the summary became available. At a glance, the proposed change to ParserCache does not make that easier (or harder).

The issue here is that the cached HTML would have to be updated at some later point, when the summary became available. At a glance, the proposed change to ParserCache does not make that easier (or harder).

The idea is to allow HTML associated with a page to be updated at any later time. The idea is further that we have one service for managing this kind of cache, instead of creating a new one for every kind of data we are deriving from page content.