Page MenuHomePhabricator

[Bug] Page summaries should not strip the normalized title from the extract?
Closed, ResolvedPublic

Description

Parenthetical stripping discussions have come up before but this may be a special case. When the normalized title contains parentheses and that entire title appears in the extract, it could be preserved instead of stripped. I don't know if this should be a bug, optimization, or working as intended.

This could also be an opportunity to revisit parenthetical stripping behavior. Given that page previews has been enabled by default in production for some time now, editors may already be tweaking lead paragraphs for optimal previews and so parenthetical processing could be toned down or even removed.

Steps to reproduce

  1. Visit https://en.wikipedia.org/wiki/List_of_one-hit_wonders_in_the_United_States.
  2. Hover over the "Brandy (You're a Fine Girl)" link.

Expected results

The content is unstructured but parentheticals stripped could exclude the normalized title.

Actual results

Screenshot from 2019-06-22 16-40-23.png (502×762 px, 125 KB)

Environments observed

  • Browser version: Chromium v75.0.3770.90
  • OS version: Ubuntu v19.04
  • Device model: Desktop
  • Device language: English

Check any additional observations

Page summary API response

Response
// https://en.wikipedia.org/api/rest_v1/page/summary/Brandy_(You're_a_Fine_Girl)

{
  "type": "standard",
  "title": "Brandy (You're a Fine Girl)",
  "displaytitle": "Brandy (You're a Fine Girl)",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "wikibase_item": "Q4957221",
  "titles": {
    "canonical": "Brandy_(You're_a_Fine_Girl)",
    "normalized": "Brandy (You're a Fine Girl)",
    "display": "Brandy (You're a Fine Girl)"
  },
  "pageid": 6744625,
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/en/c/cf/Brandy_-_Looking_Glass.jpg",
    "width": 315,
    "height": 315
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/en/c/cf/Brandy_-_Looking_Glass.jpg",
    "width": 315,
    "height": 315
  },
  "lang": "en",
  "dir": "ltr",
  "revision": "896234623",
  "tid": "4ac7b600-7216-11e9-b32a-c62d8f42e7a5",
  "timestamp": "2019-05-09T04:52:43Z",
  "description": "1972 pop song",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.org/wiki/Brandy_(You're_a_Fine_Girl)",
      "revisions": "https://en.wikipedia.org/wiki/Brandy_(You're_a_Fine_Girl)?action=history",
      "edit": "https://en.wikipedia.org/wiki/Brandy_(You're_a_Fine_Girl)?action=edit",
      "talk": "https://en.wikipedia.org/wiki/Talk:Brandy_(You're_a_Fine_Girl)"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.org/wiki/Brandy_(You're_a_Fine_Girl)",
      "revisions": "https://en.m.wikipedia.org/wiki/Special:History/Brandy_(You're_a_Fine_Girl)",
      "edit": "https://en.m.wikipedia.org/wiki/Brandy_(You're_a_Fine_Girl)?action=edit",
      "talk": "https://en.m.wikipedia.org/wiki/Talk:Brandy_(You're_a_Fine_Girl)"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.org/api/rest_v1/page/summary/Brandy_(You're_a_Fine_Girl)",
    "metadata": "https://en.wikipedia.org/api/rest_v1/page/metadata/Brandy_(You're_a_Fine_Girl)",
    "references": "https://en.wikipedia.org/api/rest_v1/page/references/Brandy_(You're_a_Fine_Girl)",
    "media": "https://en.wikipedia.org/api/rest_v1/page/media/Brandy_(You're_a_Fine_Girl)",
    "edit_html": "https://en.wikipedia.org/api/rest_v1/page/html/Brandy_(You're_a_Fine_Girl)",
    "talk_page_html": "https://en.wikipedia.org/api/rest_v1/page/html/Talk:Brandy_(You're_a_Fine_Girl)"
  },
  "extract": "\"Brandy \" is a 1972 song written and composed by Elliot Lurie and recorded by Lurie's band, Looking Glass, on their debut album Looking Glass. The single reached number one on both the Billboard Hot 100 and Cash Box Top 100 charts, remaining in the top position for one week. It reached number two on the former chart for four weeks, stuck behind Gilbert O'Sullivan's \"Alone Again (Naturally)\", before reaching number one, only for \"Brandy\" to be dethroned by \"Alone Again (Naturally)\" the week after. Billboard ranked it as the 12th song of 1972. Horns and strings were arranged by Larry Fallon.",
  "extract_html": "<p>\"<b>Brandy </b>\" is a 1972 song written and composed by Elliot Lurie and recorded by Lurie's band, Looking Glass, on their debut album <i>Looking Glass.</i> The single reached number one on both the <span><i>Billboard</i> Hot 100</span> and <span><i>Cash Box</i> Top 100</span> charts, remaining in the top position for one week. It reached number two on the former chart for four weeks, stuck behind Gilbert O'Sullivan's \"Alone Again (Naturally)\", before reaching number one, only for \"Brandy\" to be dethroned by \"Alone Again (Naturally)\" the week after. <span><i>Billboard</i></span> ranked it as the 12th song of 1972. Horns and strings were arranged by Larry Fallon.</p>"
}

Event Timeline

I remember this has been discussed in the past. This preview especially comes to mind:

Screen Shot 2019-06-25 at 10.38.37 AM.png (334×689 px, 192 KB)

At the time, we decided that the change would be too complex, but I'm not sure if we've changed things since then

LGoto raised the priority of this task from Medium to Needs Triage.Jun 26 2019, 3:41 PM

One idea, just to throw out there, would be to check if the title has any parentheses. If so we could just skip stripping any parentheses. I think that could be an easy check to make. The downside would be that the extract may have parentheses we might have removed otherwise.

Another idea is to replace the title with a tag/markup before performing any transformations and then re-add the title and remove the tag/markup, for example:

  1. get the html text
<p>"<b>Brandy (You're a Fine Girl)</b>" is a 1972 song written and composed by Elliot Lurie and recorded by Lurie's band, Looking Glass, on their debut album <i>Looking Glass.</i> [...]
  1. Replace title from html text before summarize transformation
<p>"<b><post-process-title></b>" is a 1972 song written and composed by Elliot Lurie and recorded by Lurie's band, Looking Glass, on their debut album <i>Looking Glass.</i> [...]
  1. Re-add title after summarize transformation replacing <post-process-title>
<p>"<b>Brandy (You're a Fine Girl)</b>" is a 1972 song written and composed by Elliot Lurie and recorded by Lurie's band, Looking Glass, on their debut album <i>Looking Glass.</i> [...]
LGoto raised the priority of this task from Medium to High.Aug 21 2019, 3:52 PM

We're focused on other goals right now for Q1 but this is important and should be looked into.

We would be happy to review patches to the service if anyone from the web team has some cycles before we get to it, it may take some time.

LGoto lowered the priority of this task from High to Medium.Aug 28 2019, 3:54 PM
vadim-kovalenko changed the task status from Open to In Progress.Jan 18 2023, 6:39 PM
vadim-kovalenko claimed this task.
vadim-kovalenko moved this task from Backlog to In Progress on the Content-Transform-Team-WIP board.

Change 881384 had a related patch set uploaded (by Vadim Kovalenko; author: Vadim Kovalenko):

[mediawiki/services/mobileapps@master] Mobileapps: Page summaries should not strip the normalized title from the extract

https://gerrit.wikimedia.org/r/881384

This bug usually happens with articles that have music album/song name in the lead paragraph. I found a few more examples of the bug:

There is a regexp pattern that I applied to mitigate this issue:

  1. Text is inside <b> tag.
  2. Text contains parentheses.
  3. Text might contain single quotes and spaces.

Regexp doesn't contain numbers. This is a pretty rare case that increases regexp calculation, so I decided to omit it (for now).

Change 881384 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Mobileapps: Page summaries should not strip the normalized title from the extract

https://gerrit.wikimedia.org/r/881384

Still seeing the bug on prod (but perhaps it's not deployed yet?)

IMG_E945F45A8F92-1.jpeg (2×1 px, 1 MB)

Since this is against production data, I'm not really sure how to confirm this against the beta cluster. https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Brandy_(You're_a_Fine_Girl) fails for me. https://en.wikipedia.org/api/rest_v1/page/summary/Brandy_(You're_a_Fine_Girl) is still stripping the parenthesis.

Seems that this hasn't been deployed to prod yet, local instance works for me, pls check: http://localhost:8888/en.wikipedia.org/v1/page/summary/Brandy_(You're_a_Fine_Girl)

extract_html: "<p>\"<b>Brandy (You're a Fine Girl)</b>\" is a 1972 song by American pop rock band Looking Glass from their debut album, <i>Looking Glass</i>. It was written by Looking Glass lead guitarist and co-vocalist Elliot Lurie.</p>"

@vadim-kovalenko I tested it locally, and interestingly it looks like iOS also strips out parenthesis client side (here). I'm not opposed to modifying it to match, or removing entirely so that we only have parenthesis logic in one place server-side. Let's discuss this and T263932 in our next sync and decide.