Page MenuHomePhabricator

Link preview sometimes shows wikitext markup and "Edit"
Closed, ResolvedPublic

Description

  1. Visit the article [[User:Dmitry Brant/sandbox]]
  2. Click on the link to [[Laurence Hyde (artist)]]

Expected:

The link preview shows an extract from the article with no wikitext in the contents of the extract.

Actual:

The extract text contains wikitext.

Event Timeline

Spage created this task.Jun 2 2015, 7:35 PM
Spage raised the priority of this task from to Needs Triage.
Spage updated the task description. (Show Details)
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 2 2015, 7:35 PM
Spage updated the task description. (Show Details)Jun 15 2015, 4:58 AM
Spage set Security to None.

I copied the page and the TextExtracts API request for my copy doesn't include the word "Edit". I don't know why the enwiki TextExtracts returns "...== Etymology and usageEdit ==...", weird.

Jdlrobson triaged this task as Normal priority.Sep 16 2015, 6:51 PM
Jdlrobson added a subscriber: Jdlrobson.

Is this still an issue? If so can someone add an API request and the wikitext for the lead section of the article which has the problem? It will make this task more actionable.

Dbrant added a subscriber: Dbrant.Sep 16 2015, 7:12 PM

@Jdlrobson The issue is still present, as described by @Spage in the description. (a sample API request is mentioned therein)

It looks like this issue applies to articles where the lead section is smaller than the requested number of characters. For example:

  • The entirety of the lead section for [[Crepuscular]] is literally the single sentence, "Crepuscular animals are those that are active primarily during twilight (i.e., dawn and dusk)."
  • If I issue a query with exsentences=2, or with exsentences=10, it still (correctly?) returns just the single sentence.
  • It's only when I query with exchars that TextExtracts seems to go beyond the boundary of the lead section, and returns extra characters.

Change 242141 had a related patch set uploaded (by Dbrant):
Prevent TextExtracts from possibly returning wikitext (workaround).

https://gerrit.wikimedia.org/r/242141

Change 242141 merged by jenkins-bot:
Prevent TextExtracts from possibly returning wikitext (workaround).

https://gerrit.wikimedia.org/r/242141

The patch only fixes the app. The problem is still present in TextExtracts.

dr0ptp4kt lowered the priority of this task from Normal to Low.Aug 4 2016, 3:49 PM
dr0ptp4kt moved this task from Incoming to Triaged but Future on the Readers-Web-Backlog board.
Dbrant updated the task description. (Show Details)Feb 24 2017, 12:07 AM
Mhurd added a subscriber: Mhurd.Mar 14 2017, 9:19 PM
Mhurd added a comment.Mar 14 2017, 9:24 PM

One friendly vote for bumping priority up a bit on this one :)

Screenshot of a test page I was playing with which exhibits the extract bug:

Seconded. This still happens pretty frequently in our link previews for articles whose lead section is sufficiently small.

bearND added a project: Services.EditedMar 14 2017, 9:52 PM
bearND added subscribers: Mholloway, bearND.

Should we just add the exintro=true parameter to the RESTBase /page/summary endpoint and to the apps (maybe web, too)?
Or could we make exintro=true the default? I guess changing the default behavior after the fact could be problematic.

Thanks to @Mholloway for finding this. Adding the Services team since the /page/summary endpoint is implemented directly in RESTBase.

Here's the example query:

Should we just add the exintro=true parameter to the RESTBase /page/summary endpoint and to the apps (maybe web, too)?

Seems like a good idea. I can easily make a PR for RESTBase. Do we want to force-regenerate all summaries to get rid of this or it's fine to rely on natural regenerations happening as the pages get edited?

Pchelolo edited projects, added Services (next); removed Services.Mar 14 2017, 10:00 PM

The only cases I'm seeing are ones where "==" shows up. Wouldn't it be better to simply do a substr match, excluding any text after and including the first indexOf "==" ?

Are there other types of wikitext that show up?

Oh wait, ignore that's what exintro does...
So there's no bug in TextExtracts here is there? It's just how it's being used? Should we remove the project?

We've fixed it in RESTBase by https://github.com/wikimedia/restbase/pull/771 (not deployed yet).

Since it's affecting hovercards now, we can bump the summary version to ignore all the content stored in Cassandra and regenerate the content on demand instead of waiting for each of the affected articles to get edited. @Jdlrobson is the active regeneration needed on your side?

That doesn't purge the Varnish cached content though, so it will stick for some time until it fells out of the cache.

Spoke with @Pchelolo yesterday and he said that the fix can be SWATed today and the existing cache purged.

@Jdlrobson Will deploy the fix later today, yes. But we can't mass-purge Varnish, so will have to wait a bit until wrong content falls out of Varnish.

mobrovac closed this task as Resolved.Mar 15 2017, 7:04 PM
mobrovac assigned this task to Pchelolo.
mobrovac edited projects, added Services (done); removed Services (next).
mobrovac added a subscriber: mobrovac.

The fix has been deployed. The changes will become visible as soon as the summary outputs fall out of cache.

Mhurd added a comment.Mar 15 2017, 9:57 PM

Fix confirmed! :)

Contrast with "before" screenshot: https://phabricator.wikimedia.org/T101153#3099947

Nice work!