Page MenuHomePhabricator

Page previews ignoring text in parentheses breaks some pages
Closed, ResolvedPublic

Description

The page preview (the thing you see when you mouse over a link) shows the first couple sentences of the linked page, but ignores text in parenthese. Some mathematical pages use parentheses to indicate ordered pairs and removing them makes the sentence unintelligible. For example, the Wikipedia page https://en.wikipedia.org/wiki/Completely_metrizable_space starts out as follows:

"In mathematics, a completely metrizable space (metrically topologically complete space) is a topological space (X, T) for which there exists at least one metric d on X such that (X, d) is a complete metric space and d induces the topology T."

In the preview this reads "...at least one metric d on X such that is a complete metric space..." which makes no sense.

Event Timeline

Jdlrobson edited projects, added Web-Team-Backlog (Tracking); removed Web-Team-Backlog.
Jdlrobson subscribed.

My understanding is this is a bug in the API. I remember when we speced this out we only kept parentheses without spaces. This one has a space and a single comma. it's possible we could expand the summary endpoint to allow two word matches with a comma.

Pcoombe renamed this task from Page previews ignoring text in parentheses breaks some mathematical pages to Page previews ignoring text in parentheses breaks some pages.Sep 18 2022, 12:13 PM
vadim-kovalenko changed the task status from Open to In Progress.Jan 13 2023, 10:51 AM
vadim-kovalenko claimed this task.
vadim-kovalenko moved this task from Needs Triage to In Progress on the Page Content Service board.

Change 879900 had a related patch set uploaded (by Vadim Kovalenko; author: Vadim Kovalenko):

[mediawiki/services/mobileapps@master] Mobileapps: Page previews ignoring text in parentheses breaks some pages

https://gerrit.wikimedia.org/r/879900

Change 879900 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Mobileapps: Page previews ignoring text in parentheses breaks some pages

https://gerrit.wikimedia.org/r/879900

@vadim-kovalenko I am still seeing the bug when long pressing on iOS (I went to Polish space on EN Wikipedia > long pressed the "completely metrizable" link). We do have an existing client-side bug where we aren't pulling the summary endpoint html, however, so that could be why.

IMG_0166.PNG (2×1 px, 2 MB)

@Tsevener a lot of lead paragraphs have some text in parentheses which is truncated. You can check my comment here for more details: https://phabricator.wikimedia.org/T259891#8542852. In your particular case, there is a ref inside parentheses that isn't filtered by regexp implemented in this patch. My solution tends to work only with italic letters. There are tons of possible cases where parentheses with the content should be kept for the preview, let's discuss.

We are going to close this task but it is blocked by https://phabricator.wikimedia.org/T91344 which is still open.

Hey there! I'm not sure if this discussion is exactly the right place to bring up this issue, but it's definitely related. This might provide more context for what I'm talking about. Essentially, previews seem to have instructions to remove any text inside parentheses and replace it with a space [ ] character, and this causes an issue when there is punctuation right after the parentheses. Is this something that can be fixed in this particular situation?

@vadim-kovalenko: Going to close as declined? Or did something get resolved?

@Aklapper I was about to close it as resolved but need product and web agreements first. There should be a decision about the definition of what exactly should be excluded inside parentheses.

Closed this as resolved since this issue has been fixed and fixes are currently on prod. If there are any related problems, feel free to file a new ticket for them.