Page MenuHomePhabricator

Review exclude all approach to parenthetical elements in summary endpoint
Open, Stalled, LowestPublic

Description

Currently we scrub all parentheticals in the page summary endpoint. This task is about reconsidering and changing this approach (nice summary of current state is here: https://phabricator.wikimedia.org/T91344#4462979)

Identify specific parenthetical elements to exclude from Hovercards, e.g. class="IPA" for pronunciation templates, to let Hovercards be more specific in its content filtering.

This is a followup from the decision to remove all parenthetical content - which was originally done because a few types of content can often take up a lot of characters in the first sentence, which made the hovercard/extract a lot less useful. Those types are:

This aspect has been discussed in a few topics, such as:

and originally removed in T67138: Hovercards: Fix the bracket removal code (and followup in T69225: Hovercards: Space before removed parentheses should also be removed in the extract).

We'd like to do it more precisely/judiciously, but without negatively affecting other re-users of extracts... - so we can't just re-use the class="noexcerpt" (e.g. template:IPAc-en) or class="nopopups" (e.g. template:H:IPA) without deeply understanding what those classes already cover.


NOTE: Currently, renderer.article requests TextExtracts for plaintext excerpts and removes everything within brackets.

Can anyone help us map out the existing uses of class="noexcerpt" and class="nopopups" ?
Or suggest other ideas?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
TheDJ added a subscriber: TheDJ.Mar 6 2015, 8:56 PM

I note that birth/death dates were truly missed by users when filtered out. This is information that a lot of people expect to see in the first sentence (unlike ipa, etymology).

@TheDJ this is work that we plan on doing post launch as an incremental improvement, if you or anyone else want to start the work of collecting the list of what to show and what not to, we will hopefully get around to an update that includes it.

Qgil added a subscriber: Qgil.Jun 27 2015, 9:01 PM

For what is worth, this has been identified as on of the main problems of Hovercards at https://ca.wikipedia.org/wiki/Tema:Sjzjt2dlgwr6c8o3.

Among all the parenthesised content in first sentence of articles, it's important not to exclude date and place of birth and death, since they give a lot of context - usually as much as the whole first sentence. Excluding pronunciation and etymology to leave space for first sentences is OK.

Quiddity triaged this task as High priority.Jun 30 2015, 10:31 PM
Jdlrobson lowered the priority of this task from High to Normal.Nov 2 2015, 11:00 PM
Jdlrobson added a subscriber: Jdlrobson.

Personally, I like having all parenthetical phrases removed in the previews, even dates. It makes the previews easy to read and easy to quickly figure out what the subject is. Also, it seems like it would be very difficult (impossible?) to reliably strip everything but the dates across different languages. Take these example first sentences from English Wikipedia:

Wu Zetian (simplified Chinese: 武则天; traditional Chinese: 武則天; pinyin: Wǔ Zétiān; Wade–Giles: Wu3 Tse2-t'ien1, February 17, 624 – December 16, 705), also known as Wu Zhao (Wu Chao; Chinese: 武曌; pinyin: Wǔ Zhào; Wade–Giles: Wu3 Chao4), Wu Hou (Chinese: 武后; pinyin: Wǔ Hòu; Wade–Giles: Wu3 Hou4) and during the later Tang dynasty as Tian Hou (天后), referred to in English as Empress Consort Wu or by the deprecated term "Empress Wu", was a Chinese sovereign who ruled unofficially as Empress and later, officially as Emperor of China (皇帝) during the brief Zhou dynasty (周, 690-705), which interrupted the Tang dynasty (618–690 & 705–907).

or

Caroline Fourest (born September 19, 1975 in Aix-en-Provence) is a French feminist writer, documentary director, journalist, columnist (every Saturday in Le Monde until 14 July 2012), radio presenter (every Monday on the France Culture public radio channel), editor of the magazine ProChoix ("antiracist" and "secularist"), and author of Frère Tariq ("Brother Tariq"), a critical look at the works of Muslim intellectual Tariq Ramadan.

(notice the 2nd parenthetical phrase in the example above).

I note that birth/death dates were truly missed by users when filtered out. This is information that a lot of people expect to see in the first sentence (unlike ipa, etymology).

Among all the parenthesised content in first sentence of articles, it's important not to exclude date and place of birth and death, since they give a lot of context

  • make sure not to strip dates & places
Qgil removed a subscriber: Qgil.May 23 2016, 8:40 AM

@ovasileva @bmansurov Per grooming: this could probably use a formal spike.

kaldari added a comment.EditedMay 1 2017, 6:29 PM

Here are some of the things you would need to exclude, most of which are not programmatically excludable except by removing all parenthetical phrases:

  • birth and death dates (debatable)
  • birth and death dates in alternate calendars
  • birth and death locations
  • alternate names (debatable)
  • maiden names (debatable)
  • foreign names
  • pronunciations
  • foreign pronunciations
  • transliterations
  • etymology
  • minor-planet designation
  • etc.

If you want to see my take on why this task should be declined (in favor of stripping all parenthetical phrases), please read https://en.wikipedia.org/wiki/User:Kaldari/Wikipedia's_lead_sentence_problem.

Quiddity added a comment.EditedMay 5 2017, 1:31 AM

As TheDJ wrote in the first comment, and as I wrote elsewhere (many times), we really want to retain some of the parenthetical data. There is some agreement that IPA/pronunciation can probably be excluded, but other details that appear in brackets in the first few sentences can be very important.

If you want to see my take on why this task should be declined (in favor of stripping all parenthetical phrases), please read https://en.wikipedia.org/wiki/User:Kaldari/Lead_sentences_have_cancer.

I disagree we should automatically exclude all parenthetical content, just because some article examples are bad. I do agree it is a widespread problem, particularly with place names in non-English countries (at the English Wikipedia), but that is not the best way to handle it, for this purpose.

Otherwise, we completely screw up articles like https://en.wikipedia.org/wiki/Birdman_(film) - and make other articles confusing like https://en.wikipedia.org/wiki/RAF_Hospital_Wegberg - and remove highly relevant information about common names from some zoological articles like https://en.wikipedia.org/wiki/Bothrops_campbelli - and etc.
In particular, many users, especially at Dewiki and Enwiki, were very unhappy to have life-span dates automatically stripped out of biographies, as that is often crucial contextual information (is this an 18th century author or 21st century author?!).
Additionally, it is misinformative to omit that a transliteration of someone's name is not actually their name. The real name of https://en.wikipedia.org/wiki/Jigme_Khesar_Namgyel_Wangchuck is འཇིགས་མེད་གེ་སར་རྣམ་རྒྱལ་དབང་ཕྱུག་, and readers/editors should be reminded of that, not be left thinking that the latin script rules all things.

MaxSem added a subscriber: MaxSem.May 5 2017, 1:41 AM

John Doe (born May 4, 2017) is a Nullislandian who was convicted of eating babies (falsely).

Try throwing parentheses out of this.

The Android app folks went through something very similar in 2015: T91792: As a user, I'd like lead sentences to be concise so I can get an overview of the topic I'm reading about. See e.,g. @Krinkle's thoughts and examples there: T91792#1267662 .

I really think this is best left to editors https://en.wikipedia.org/wiki/User:Kaldari/Wikipedia's_lead_sentence_problem. I tend to agree with @kaldari that this should be declined and we should encourage editors to use the "noexcerpt" class (using global wgExtractsRemoveClasses and documented here: https://www.mediawiki.org/wiki/Extension:Popups#How_can_I_remove_content_from_a_page_preview.3F

[...] I tend to agree with @kaldari that this should be declined and we should encourage editors to use the "noexcerpt" class [...]

I agree with that option very strongly, but note that @kaldari specifically wrote:

[...] this task should be declined (in favor of stripping all parenthetical phrases), [...]

Whereas I think (IIUC) that you are suggesting we should decline this task, but also reinstate the content which is currently hidden merely because it is within parentheses.

That is my (and most editors) top-most preference, but it was previously declared off-the-table. It will look odd in a few instances, but it will avoid misinformation & confusion & gross omission.

Note that the IPA templates on Enwiki are all that seem to use the class=noexcerpt so far: insource search - but they're also the only element that has widespread agreement as not-very-helpful.

ovasileva changed the task status from Open to Stalled.Jun 7 2017, 6:51 PM
ovasileva lowered the priority of this task from Normal to Low.

For current implementation of page previews, we will be stripping all parentheticals. We can identify edge cases, etc, during iteration.

If this happens it will not happen in TextExtracts but the new REST endpoint planned here: T113094

For current implementation of page previews, we will be stripping all parentheticals. We can identify edge cases, etc, during iteration.

I found our first edge case and I don't see how we can fix that: https://www.mediawiki.org/wiki/Topic:Tvv8ff8h44bt74t8
Removing them still seems like a very bad idea to me.

MaxSem removed a subscriber: MaxSem.Aug 8 2017, 8:08 PM
Krinkle removed a subscriber: Krinkle.Aug 16 2017, 12:48 AM

How is this task a duplicate of T168848? (The task description there only says "Strip balanced parentheticals", without identifying specific parenthetical elements.)

The decision was made to strip ALL balanced parentheticals (see T168848 description). We will not not be excluding any.

Restricted Application added a subscriber: jeblad. · View Herald TranscriptAug 25 2017, 8:37 PM

I still strongly advise against stripping everything, due to the many drawbacks identified here and elsewhere, especially regarding birth and death dates, plus the various edge-cases where it causes actual problems.
Editors and readers (potential editors) won't see the problem with long parenthicals, if we hide it from them.
I suggest leaving this task open, tweaking the description however necessary, and setting to lowest priority, so that it can be easily found and discussed whilst the current experiment with stripping all balanced parentheticals is deployed.

ovasileva reopened this task as Stalled.Aug 28 2017, 10:17 AM
ovasileva lowered the priority of this task from Low to Lowest.

I like @Quiddity's suggestion. For now, for the majority of cases, they are detrimental to the reading experience and we will be removing them. That said, I think it's good to keep this task around for more related conversation. Our hypothesis from this end is that, as content in first paragraphs improves over time (on a very, very long time scale), we will be able to eventually add parentheticals back in.

I like @Quiddity's suggestion. For now, for the majority of cases, they are detrimental to the reading experience and we will be removing them.

I'm not sure that I follow.

I think @Quiddity was suggesting that we don't strip any parenthetical statements because of the number of edge cases and letting readers and editors decide whether they're detrimental to the preview reading experience. Is this a correct reading of T91344#3554418?

By lowering the priority of this task, are we saying that:

  • We're not going to strip any parenthetical statements; or
  • We're going to strip all parenthetical statements and slowly allow certain classes of statement back in (i.e. create a whitelist); or
  • …?

I suggest leaving this task open, tweaking the description however necessary, and setting to lowest priority, so that it can be easily found and discussed whilst the current experiment with stripping all balanced parentheticals is deployed.

@phuedx - sorry, that was unclear! I was referring to the above suggestion. We will be stripping all parentheses for now, but leave this task open to continue capturing ideas on improving parenthetical stripping for a future/second iteration.

I've linked to this task here: https://www.mediawiki.org/wiki/Extension:Popups#Why_are_parenthetical_stripped.3F
It seems to be a decision was made but it's open to change and we can continue talking about that here without fear of losing this task.

Just in case it's overlooked at the bottom of the Description, I'll repeat:
I think it would be really helpful (both for this product, and probably many other areas), to have a good explanation somewhere about the existing uses of class="noexcerpt" and class="nopopups". I should've also added class="nomobile" and class="noprint" to that list. I.e.

  • where the classes are defined in the code/CSS,
  • where they are currently used in the page-contents across some example wikis (e.g. Enwiki: noexcerpt, nopopups, nomobile, noprint - and if you add Article-namespace to those searches there are more results),
  • and how editors are meant to use them.

If that sounds useful to you, please update the task Description however would be helpful, or link to any existing docs/research if you know of any, or fork into another task, or etc. :-)

@Quiddity these are documented here, no?: https://www.mediawiki.org/wiki/Extension:Popups#FAQ
nopopups is not supported and noprint and nomobile are not intended for usage with Popups.

@Quiddity these are documented here, no?: https://www.mediawiki.org/wiki/Extension:Popups#FAQ
nopopups is not supported

But it is used by Enwiki for IPA templates, which are much discussed in this task, (and used by Nlwiki for their "citation needed" and "when" inline templates), so I thought it might be pertinent to understand how and where it is used, in order to not duplicate any efforts.

and noprint and nomobile are not intended for usage with Popups.

I saw those listed in https://gerrit.wikimedia.org/g/mediawiki/extensions/TextExtracts/+/master/extension.json#52 hence I mentioned.

I saw those listed in https://gerrit.wikimedia.org/g/mediawiki/extensions/TextExtracts/+/master/extension.json#52 hence I mentioned.

Page previews no longer uses TextExtracts so this doesn't apply. I've clarified this in the FAQ. TBH TextExtracts shouldn't be scrubbing those. They are for 2 different purposes - 1) hiding content on mobile and 2) hiding content in print mode. "noexcerpt" is more explicit.

But it is used by Enwiki for IPA templates, which are much discussed in this task, (and used by Nlwiki for their "citation needed" and "when" inline templates), so I thought it might be pertinent to understand how and where it is used, in order to not duplicate any efforts.

I believe "nopopups" is used by Navigation popups to disable showing previews on certain links. We don't support that on page previews, but potentially could as a solution to T169370.

@bearND, @Mholloway: Does the Page Summary API support removal of <$tagName class="noexcerpt"> elements as laid out in T91344#4036824?

@bearND, @Mholloway: Does the Page Summary API support removal of <$tagName class="noexcerpt"> elements as laid out in T91344#4036824?

Yes, .noexcerpt elements (among others) are already stripped in the course of generating a summary.

https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transformations/summarize.js#L39

I'm a little confused. This is done, right?

I'm a little confused. This is done, right?

I believe we're still stripping all content within parentheses, (except where those parentheses are connected to another string, e.g. the RAF_Hospital_Wegberg instance is now fixed), so this task hasn't really changed status.

All of articles I've glance-tested still have the problems discussed above - missing date of birth, missing species' common-name, missing parts of proper title (e.g. for Birdman), etc.

See also T91344#3561474 above, and your reply to that. I don't think anything has changed since then. Here, we're still (ideally) collecting notes on Examples of problems, and Options/Ideas to resolve them.

Jdlrobson renamed this task from Identify specific parenthetical elements to exclude from Hovercards to Review exclude all approach to parenthetical elements.Jul 31 2018, 2:46 AM
Jdlrobson renamed this task from Review exclude all approach to parenthetical elements to Review exclude all approach to parenthetical elements in summary endpoint.
Jdlrobson removed a project: Page-Previews.
Jdlrobson updated the task description. (Show Details)

Parentheses are not stripped from East Asian languages because different characters are used

This has also been fixed. I wrote the code. If not I'd need some examples. If so can we remove this from description?
I've updated the description to make the current state clear.