Page MenuHomePhabricator

Content intended to be hidden appears in text extract
Closed, ResolvedPublic

Description

Description

Steps to reproduce

  1. Go to the Bethe formula article.
  2. Tap on the speed of light link under the formula section,

Expected results

A summary for the speed of light article is shown.

Actual results

A summary is shown but it includes a black spade character.

screenshot-2016-02-09-06-42-51-210452036.png (2×1 px, 353 KB)

Response

https://en.wikipedia.org/api/rest_v1/page/summary/Speed_of_light
{
  "title": "Speed of light",
  "extract": "The speed of light in vacuum, commonly denoted c, is a universal physical constant important in many areas of physics. Its precise value is 7008299792458000000♠299792458 metres per second (approximately 7008300000000000000♠3.00×108 m/s), since the length of the metre is defined from this constant and the international standard for time. According to special relativity, c is the maximum speed at which all matter and information in the universe can travel. It is the speed at which all massless particles and changes of the associated fields (including electromagnetic radiation such as light and gravitational waves) travel in vacuum. Such particles and waves travel at c regardless of the motion of the source or the inertial reference frame of the observer.",
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Earth_to_Sun_-_en.png/320px-Earth_to_Sun_-_en.png",
    "width": 320,
    "height": 181
  },
  "lang": "en",
  "dir": "ltr"
}
https://en.wikipedia.org/w/api.php?action=query&format=json&formatversion=2&prop=extracts%7Cpageimages&redirects=true&exsentences=5&explaintext=true&piprop=thumbnail%7Cname&pithumbsize=320&titles=Speed_of_light
{
  "batchcomplete": true,
  "query": {
    "normalized": [
      {
        "from": "Speed_of_light",
        "to": "Speed of light"
      }
    ],
    "pages": [
      {
        "pageid": 28736,
        "ns": 0,
        "title": "Speed of light",
        "extract": "The speed of light in vacuum, commonly denoted c, is a universal physical constant important in many areas of physics. Its precise value is 7008299792458000000♠299792458 metres per second (approximately 7008300000000000000♠3.00×108 m/s), since the length of the metre is defined from this constant and the international standard for time. According to special relativity, c is the maximum speed at which all matter and information in the universe can travel. It is the speed at which all massless particles and changes of the associated fields (including electromagnetic radiation such as light and gravitational waves) travel in vacuum. Such particles and waves travel at c regardless of the motion of the source or the inertial reference frame of the observer.",
        "thumbnail": {
          "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Earth_to_Sun_-_en.png/320px-Earth_to_Sun_-_en.png",
          "width": 320,
          "height": 181
        },
        "pageimage": "Earth_to_Sun_-_en.png"
      }
    ]
  }
}

Environments observed

Service version: deploy/2016-02-02/68e38ec
App version: 00a1c69
Android OS versions: API 23
Device model: Nexus 6P
Device language: English

Event Timeline

Niedzielski raised the priority of this task from to Medium.
Niedzielski updated the task description. (Show Details)
Niedzielski subscribed.
bearND subscribed.

As @Niedzielski mentioned in the edit, the same happens with the api.php endpoint. The issue must be upstream in the TextExtracts functionality.

This is ultimately caused by the Val template inserting sort keys into the actual article text, wrapped in display:none spans that are then flattened to plain text when TextExtracts is called with explaintext=true.

Its precise value is <b><span class="nowrap"><span style="display:none" class="sortkey">7008299792458000000♠</span>299<span style="margin-left:
.25em;">792</span><span style="margin-left:.25em;">458</span>&#160;<a href="/wiki/Metres_per_second" class="mw-redirect" title="Metres per 
second">metres per second</a></span></b> (approximately <span class="nowrap"><span style="display:none" class="sortkey">7008300000000000000♠
</span>3.00<span style="margin-left:0.25em;margin-right:0.15em;">×</span>10<sup>8</sup>&#160;m/s</span>), since the length of the metre is 
defined from this constant and the <a href="/wiki/Second#International_second" title="Second">international standard for time</a>.
Jdlrobson subscribed.

Looks like something that will need to be fixed locally (on wiki).

Mholloway renamed this task from [Bug] Article summary shows encoding issue to [Bug] Content intended to be hidden appears in text extract.Sep 29 2016, 7:59 PM
Niedzielski renamed this task from [Bug] Content intended to be hidden appears in text extract to Content intended to be hidden appears in text extract.Nov 9 2016, 7:35 PM
TheDJ subscribed.

Shouldn't we add the sortkey class to the list of elements that need to be stripped from extract ? wgExtractsRemoveClasses

We do the same for coordinates for instance.

Change 344742 had a related patch set uploaded (by Kaldari):
[mediawiki/extensions/TextExtracts@master] Adding sortkey class to ExtractsRemoveClasses

https://gerrit.wikimedia.org/r/344742

Change 344742 merged by jenkins-bot:
[mediawiki/extensions/TextExtracts@master] Adding sortkey class to ExtractsRemoveClasses

https://gerrit.wikimedia.org/r/344742

phuedx subscribed.

At the very least, this can be verified tomorrow (Thursday, 30th) after the MediaWiki train rolls on by.

This appears to still be an issue on prod for both endpoint examples given. :/

This appears to still be an issue on prod for both endpoint examples given. :/

This appears to be a caching issue. I purged the cache on Speed of light and now the action API query no longer returns the extraneous data. This will slowly roll out as pages are edited or otherwise have their cache reset.

This appears to still be an issue on prod for both endpoint examples given. :/

Sorry. I {could,should}'ve explained this better. Thanks for clarifying the situation and providing steps for testing/sign off @Deskana!


I reacquainted myself with the extension so that I could add a little more detail if necessary. However, I discovered that:

  • Extracts are stored in memcache indefinitely until the associated page is touched; and
  • The cache keys that the extension uses don't vary with the ExtractsRemoveClasses config variable (or some notion of a version of the codebase).

Changes to the ExtractsRemoveClasses config variable won't be reflected until the page is touched or cache entries are evicted due to memory pressure. I'm not sure how often eviction occurs but it's not limited to those keys generated by TextExtracts and keys with finite expiries are evicted first. Simply put, I can't be sure extracts for long tail pages will be affected.

Matching the parser cache's TTL doesn't seem unreasonable.

bmansurov removed bmansurov as the assignee of this task.
bmansurov subscribed.

The immediate problem has been fixed. Feel free to create a separate task considering the second part of T126331#3150048.

Matching the parser cache's TTL doesn't seem unreasonable.

This was done in rETEX43f3539a7cea: Set an expiry for memcache entries. Thanks, @Legoktm!