Page MenuHomePhabricator

Review exclude all approach to parenthetical elements in summary endpoint
Open, Stalled, LowestPublic

Description

Currently we scrub all parentheticals in the page summary endpoint. This task is about reconsidering and changing this approach (nice summary of current state is here: https://phabricator.wikimedia.org/T91344#4462979)

Identify specific parenthetical elements to exclude from Hovercards, e.g. class="IPA" for pronunciation templates, to let Hovercards be more specific in its content filtering.

This is a followup from the decision to remove all parenthetical content - which was originally done because a few types of content can often take up a lot of characters in the first sentence, which made the hovercard/extract a lot less useful. Those types are:

This aspect has been discussed in a few topics, such as:

and originally removed in T67138: Hovercards: Fix the bracket removal code (and followup in T69225: Hovercards: Space before removed parentheses should also be removed in the extract).

We'd like to do it more precisely/judiciously, but without negatively affecting other re-users of extracts... - so we can't just re-use the class="noexcerpt" (e.g. template:IPAc-en) or class="nopopups" (e.g. template:H:IPA) without deeply understanding what those classes already cover.


NOTE: Currently, renderer.article requests TextExtracts for plaintext excerpts and removes everything within brackets.

Can anyone help us map out the existing uses of class="noexcerpt" and class="nopopups" ?
Or suggest other ideas?

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Personally, I like having all parenthetical phrases removed in the previews, even dates. It makes the previews easy to read and easy to quickly figure out what the subject is. Also, it seems like it would be very difficult (impossible?) to reliably strip everything but the dates across different languages. Take these example first sentences from English Wikipedia:

Wu Zetian (simplified Chinese: 武则天; traditional Chinese: 武則天; pinyin: Wǔ Zétiān; Wade–Giles: Wu3 Tse2-t'ien1, February 17, 624 – December 16, 705), also known as Wu Zhao (Wu Chao; Chinese: 武曌; pinyin: Wǔ Zhào; Wade–Giles: Wu3 Chao4), Wu Hou (Chinese: 武后; pinyin: Wǔ Hòu; Wade–Giles: Wu3 Hou4) and during the later Tang dynasty as Tian Hou (天后), referred to in English as Empress Consort Wu or by the deprecated term "Empress Wu", was a Chinese sovereign who ruled unofficially as Empress and later, officially as Emperor of China (皇帝) during the brief Zhou dynasty (周, 690-705), which interrupted the Tang dynasty (618–690 & 705–907).

or

Caroline Fourest (born September 19, 1975 in Aix-en-Provence) is a French feminist writer, documentary director, journalist, columnist (every Saturday in Le Monde until 14 July 2012), radio presenter (every Monday on the France Culture public radio channel), editor of the magazine ProChoix ("antiracist" and "secularist"), and author of Frère Tariq ("Brother Tariq"), a critical look at the works of Muslim intellectual Tariq Ramadan.

(notice the 2nd parenthetical phrase in the example above).

I note that birth/death dates were truly missed by users when filtered out. This is information that a lot of people expect to see in the first sentence (unlike ipa, etymology).

Among all the parenthesised content in first sentence of articles, it's important not to exclude date and place of birth and death, since they give a lot of context

  • make sure not to strip dates & places

Here are some of the things you would need to exclude, most of which are not programmatically excludable except by removing all parenthetical phrases:

  • birth and death dates (debatable)
  • birth and death dates in alternate calendars
  • birth and death locations
  • alternate names (debatable)
  • maiden names (debatable)
  • foreign names
  • pronunciations
  • foreign pronunciations
  • transliterations
  • etymology
  • minor-planet designation
  • etc.

If you want to see my take on why this task should be declined (in favor of stripping all parenthetical phrases), please read https://en.wikipedia.org/wiki/User:Kaldari/Wikipedia's_lead_sentence_problem.

As TheDJ wrote in the first comment, and as I wrote elsewhere (many times), we really want to retain some of the parenthetical data. There is some agreement that IPA/pronunciation can probably be excluded, but other details that appear in brackets in the first few sentences can be very important.

If you want to see my take on why this task should be declined (in favor of stripping all parenthetical phrases), please read https://en.wikipedia.org/wiki/User:Kaldari/Lead_sentences_have_cancer.

I disagree we should automatically exclude all parenthetical content, just because some article examples are bad. I do agree it is a widespread problem, particularly with place names in non-English countries (at the English Wikipedia), but that is not the best way to handle it, for this purpose.

Otherwise, we completely screw up articles like https://en.wikipedia.org/wiki/Birdman_(film) - and make other articles confusing like https://en.wikipedia.org/wiki/RAF_Hospital_Wegberg - and remove highly relevant information about common names from some zoological articles like https://en.wikipedia.org/wiki/Bothrops_campbelli - and etc.
In particular, many users, especially at Dewiki and Enwiki, were very unhappy to have life-span dates automatically stripped out of biographies, as that is often crucial contextual information (is this an 18th century author or 21st century author?!).
Additionally, it is misinformative to omit that a transliteration of someone's name is not actually their name. The real name of https://en.wikipedia.org/wiki/Jigme_Khesar_Namgyel_Wangchuck is འཇིགས་མེད་གེ་སར་རྣམ་རྒྱལ་དབང་ཕྱུག་, and readers/editors should be reminded of that, not be left thinking that the latin script rules all things.

John Doe (born May 4, 2017) is a Nullislandian who was convicted of eating babies (falsely).

Try throwing parentheses out of this.

I really think this is best left to editors https://en.wikipedia.org/wiki/User:Kaldari/Wikipedia's_lead_sentence_problem. I tend to agree with @kaldari that this should be declined and we should encourage editors to use the "noexcerpt" class (using global wgExtractsRemoveClasses and documented here: https://www.mediawiki.org/wiki/Extension:Popups#How_can_I_remove_content_from_a_page_preview.3F

[...] I tend to agree with @kaldari that this should be declined and we should encourage editors to use the "noexcerpt" class [...]

I agree with that option very strongly, but note that @kaldari specifically wrote:

[...] this task should be declined (in favor of stripping all parenthetical phrases), [...]

Whereas I think (IIUC) that you are suggesting we should decline this task, but also reinstate the content which is currently hidden merely because it is within parentheses.

That is my (and most editors) top-most preference, but it was previously declared off-the-table. It will look odd in a few instances, but it will avoid misinformation & confusion & gross omission.

Note that the IPA templates on Enwiki are all that seem to use the class=noexcerpt so far: insource search - but they're also the only element that has widespread agreement as not-very-helpful.

ovasileva changed the task status from Open to Stalled.Jun 7 2017, 6:51 PM
ovasileva lowered the priority of this task from Medium to Low.

For current implementation of page previews, we will be stripping all parentheticals. We can identify edge cases, etc, during iteration.

If this happens it will not happen in TextExtracts but the new REST endpoint planned here: T113094

For current implementation of page previews, we will be stripping all parentheticals. We can identify edge cases, etc, during iteration.

I found our first edge case and I don't see how we can fix that: https://www.mediawiki.org/wiki/Topic:Tvv8ff8h44bt74t8
Removing them still seems like a very bad idea to me.

How is this task a duplicate of T168848? (The task description there only says "Strip balanced parentheticals", without identifying specific parenthetical elements.)

The decision was made to strip ALL balanced parentheticals (see T168848 description). We will not not be excluding any.

I still strongly advise against stripping everything, due to the many drawbacks identified here and elsewhere, especially regarding birth and death dates, plus the various edge-cases where it causes actual problems.
Editors and readers (potential editors) won't see the problem with long parenthicals, if we hide it from them.
I suggest leaving this task open, tweaking the description however necessary, and setting to lowest priority, so that it can be easily found and discussed whilst the current experiment with stripping all balanced parentheticals is deployed.

ovasileva lowered the priority of this task from Low to Lowest.

I like @Quiddity's suggestion. For now, for the majority of cases, they are detrimental to the reading experience and we will be removing them. That said, I think it's good to keep this task around for more related conversation. Our hypothesis from this end is that, as content in first paragraphs improves over time (on a very, very long time scale), we will be able to eventually add parentheticals back in.

I like @Quiddity's suggestion. For now, for the majority of cases, they are detrimental to the reading experience and we will be removing them.

I'm not sure that I follow.

I think @Quiddity was suggesting that we don't strip any parenthetical statements because of the number of edge cases and letting readers and editors decide whether they're detrimental to the preview reading experience. Is this a correct reading of T91344#3554418?

By lowering the priority of this task, are we saying that:

  • We're not going to strip any parenthetical statements; or
  • We're going to strip all parenthetical statements and slowly allow certain classes of statement back in (i.e. create a whitelist); or
  • …?

I suggest leaving this task open, tweaking the description however necessary, and setting to lowest priority, so that it can be easily found and discussed whilst the current experiment with stripping all balanced parentheticals is deployed.

@phuedx - sorry, that was unclear! I was referring to the above suggestion. We will be stripping all parentheses for now, but leave this task open to continue capturing ideas on improving parenthetical stripping for a future/second iteration.

I've linked to this task here: https://www.mediawiki.org/wiki/Extension:Popups#Why_are_parenthetical_stripped.3F
It seems to be a decision was made but it's open to change and we can continue talking about that here without fear of losing this task.

Just in case it's overlooked at the bottom of the Description, I'll repeat:
I think it would be really helpful (both for this product, and probably many other areas), to have a good explanation somewhere about the existing uses of class="noexcerpt" and class="nopopups". I should've also added class="nomobile" and class="noprint" to that list. I.e.

  • where the classes are defined in the code/CSS,
  • where they are currently used in the page-contents across some example wikis (e.g. Enwiki: noexcerpt, nopopups, nomobile, noprint - and if you add Article-namespace to those searches there are more results),
  • and how editors are meant to use them.

If that sounds useful to you, please update the task Description however would be helpful, or link to any existing docs/research if you know of any, or fork into another task, or etc. :-)

@Quiddity these are documented here, no?: https://www.mediawiki.org/wiki/Extension:Popups#FAQ
nopopups is not supported and noprint and nomobile are not intended for usage with Popups.

@Quiddity these are documented here, no?: https://www.mediawiki.org/wiki/Extension:Popups#FAQ
nopopups is not supported

But it is used by Enwiki for IPA templates, which are much discussed in this task, (and used by Nlwiki for their "citation needed" and "when" inline templates), so I thought it might be pertinent to understand how and where it is used, in order to not duplicate any efforts.

and noprint and nomobile are not intended for usage with Popups.

I saw those listed in https://gerrit.wikimedia.org/g/mediawiki/extensions/TextExtracts/+/master/extension.json#52 hence I mentioned.

I saw those listed in https://gerrit.wikimedia.org/g/mediawiki/extensions/TextExtracts/+/master/extension.json#52 hence I mentioned.

Page previews no longer uses TextExtracts so this doesn't apply. I've clarified this in the FAQ. TBH TextExtracts shouldn't be scrubbing those. They are for 2 different purposes - 1) hiding content on mobile and 2) hiding content in print mode. "noexcerpt" is more explicit.

But it is used by Enwiki for IPA templates, which are much discussed in this task, (and used by Nlwiki for their "citation needed" and "when" inline templates), so I thought it might be pertinent to understand how and where it is used, in order to not duplicate any efforts.

I believe "nopopups" is used by Navigation popups to disable showing previews on certain links. We don't support that on page previews, but potentially could as a solution to T169370.

@bearND, @Mholloway: Does the Page Summary API support removal of <$tagName class="noexcerpt"> elements as laid out in T91344#4036824?

@bearND, @Mholloway: Does the Page Summary API support removal of <$tagName class="noexcerpt"> elements as laid out in T91344#4036824?

Yes, .noexcerpt elements (among others) are already stripped in the course of generating a summary.

https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/lib/transformations/summarize.js#L39

I'm a little confused. This is done, right?

I'm a little confused. This is done, right?

I believe we're still stripping all content within parentheses, (except where those parentheses are connected to another string, e.g. the RAF_Hospital_Wegberg instance is now fixed), so this task hasn't really changed status.

All of articles I've glance-tested still have the problems discussed above - missing date of birth, missing species' common-name, missing parts of proper title (e.g. for Birdman), etc.

See also T91344#3561474 above, and your reply to that. I don't think anything has changed since then. Here, we're still (ideally) collecting notes on Examples of problems, and Options/Ideas to resolve them.

Jdlrobson renamed this task from Identify specific parenthetical elements to exclude from Hovercards to Review exclude all approach to parenthetical elements.Jul 31 2018, 2:46 AM
Jdlrobson renamed this task from Review exclude all approach to parenthetical elements to Review exclude all approach to parenthetical elements in summary endpoint.
Jdlrobson removed a project: Page-Previews.
Jdlrobson updated the task description. (Show Details)

Parentheses are not stripped from East Asian languages because different characters are used

This has also been fixed. I wrote the code. If not I'd need some examples. If so can we remove this from description?
I've updated the description to make the current state clear.

Just noticed that this task is still open, and was wondering if we could continue / finish the conversation that's happening here.
I also see that this task is still linked from the Popups extension wiki page:
https://www.mediawiki.org/wiki/Extension:Popups#Why_are_parenthetical_stripped.3F

Looking at the code (as well as the spec), we are nominally seeking to remove all parentheticals. However, we also have a growing number of rather complex exceptions, which in effect means we're not removing "all" parentheticals, and thus not following the spec. This is continuing to create edge cases where parenthetical content is stripped incorrectly (T226323, T263932), which is requiring ever more complex logic.

For now, for the majority of cases, they are detrimental to the reading experience and we will be removing them. That said, I think it's good to keep this task around for more related conversation. Our hypothesis from this end is that, as content in first paragraphs improves over time (on a very, very long time scale), we will be able to eventually add parentheticals back in.

I just wanted to check whether this is still our position? I wonder if it would make sense to restate our goal in terms of what kind of content we want to strip from summaries, instead of saying "everything except ..., ..., ..., ...". I'm also wondering if the spec should be updated to reflect more accurately what we're actually stripping, and what we're not.

Personally rather than increasing the complexity of the logic, I think we should be declining any new edge cases in favor of encouraging editors to improve their markup. The noexcerpt class was created for this purpose. I wasn't aware of https://gerrit.wikimedia.org/r/c/mediawiki/services/mobileapps/+/881384/ but in that situation I am not sure why we used a regular expression solution when we could have relied on DOM structure and simply ignored parentheticals inside a bold tag?

After reading the comments, I think the problem is with the articles and not with the tool.
First of all, the first lines of an article should be exclusively dedicated to important data. Thus, unimportant data should not be excluded by the tool, but moved within the article.
Secondly, brackets are generally used for secondary data. Putting important data in brackets is a mistake.

New compliant about scientific names getting hidden. Articles on familiar kinds of living things commonly start with something like "Big bush (Bushus biggus) is a big kind of plant"; for Biology reasons people want to keep the parenthetical. Discussion at https://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Tree_of_Life#Scientific_name_excised_from_pop-up_preview_-_please%2C_can_it_be_prevented%3F

There used to be an issue where chemical formulae with parentheses get yanked midway, but someone seems to have fixed that. (I'm guessing by requiring word-boundaries in the regex?) Nice.


Instead of asking for noexcerpt on a big bunch of articles, how about we add a class that causes the text to be always excerpted, without passing through the stripping mechanism? I am guessing that this would require much fewer cases of special marking.