Page MenuHomePhabricator

Consider excluding pronunciation guides from TextExtracts
Closed, DeclinedPublic

Description

The app clients currently use custom handling to remove pronunciation guides, theoretically this might be good to do upstream once. See parent task for background.

An example of an article that includes such a pronunciation guide (outside of parentheses) is https://en.wikipedia.org/wiki/Abugida.

Related Objects

Event Timeline

JMinor created this task.Apr 28 2017, 6:01 PM

So, we could port this Android code to RESTBase and strip it while generating the summary response. The question is whether there's any client that need this information? If not - the solution is quite straightforward.

Long term reading web team is planning to rethink how we do TextExtracts as we have lots of problems with the existing usages. My guess is that we will want to put more transformations inside the RESTBase endpoint and switch to HTML as the default for all extracts (possibly avoiding using TextExtracts extension altogether). The epic is here - T113094 - and we've started to flesh out requirements.

@Pchelolo @Jdlrobson

So short term what do we want to do here?

If we strip pronunciations fro summaries, this affects page previews. @Jdlrobson will that negatively affect Mobile Web?

If not, lets just strip it

If it will affect page previews then lets do one of the following:

  1. Have the clients to do this client side

OR

  1. Strip it, but version the change

OR

  1. Explicitly mark it up to make it easier for clients to strip

@Jdlrobson will that negatively affect Mobile Web?

This will only impact page previews. Given the issues we've seen with this in MCS I'd appreciate if this was stripped on the client for the time being.

The problem with doing it in MCS is we'd be operating on plain text. The way I'd like to see us go about addressing T113094 is to create a new endpoint that doesn't rely on TextExtracts and makes the extract HTML (given we want to support math). Would the apps want to continue using plaintext as if not we'll need to start thinking about how we ship both or provide both options in most our endpoints!

I think this task is a dupe of T91344 which looks at the bigger problem we have with parentheses.

kaldari added a subscriber: kaldari.May 1 2017, 7:26 PM

The pronunciation templates already output a class called nopopups specifically for the purpose of striping the content from navigation popups. I'm surprised Hovercards isn't already honoring this class. TextExtracts offers a similar class noexerpts, so special casing pronunciation guides shouldn't be necessary. In this case, the community is saying "We want pronunciations in TextExtracts, but we don't want them in Popups." We just need to honor the classes that are already available here, not create new code for handling a special case.

I'm going to boldly edit the title and description to clarify the correct way to solve this. If I'm overstepping, feel free to revert.

kaldari renamed this task from Consider excluding pronunciation guides from TextExtracts to Hovercards (and the apps) should honor the nopopups class within page previews.May 1 2017, 7:34 PM
kaldari updated the task description. (Show Details)
kaldari renamed this task from Hovercards (and the apps) should honor the nopopups class within page previews to Hovercards and the apps should honor the nopopups class within page previews.May 1 2017, 7:34 PM

So the page previews uses the RESTBase summary endpoint which emits the plain-text extract. Potentially we can change that to the HTML extract, but need to verify with all the known clients of the endpoint. Under the hood it uses the TextExtracts extension, but I don't see it emitting the classes in the HTML. @kaldari how do I fetch the html with the nopopup class from the TextExtracts API?

What is the history of this nopopups class? Where is it documented?
Typically use of classes in this way has been very brittle and poorly encapsulated in code.

@Jdlrobson: Use of nopopups dates to the earliest implementation of navigation popups back in 2007. It's basically used for exactly the task described here: preventing pronunciation guides from showing up in popup previews. As far as I know there is no documentation for it, however.

@Pchelolo: I see what you mean. The best approach might be for us to add nopopups to $wgExtractsRemoveClasses, but I'm wondering if this will cause weird artifacts, like "Blah (;;)". If so, maybe my implementation idea isn't feasible after all.

As a passerby, I would like add something odd that I noticed about the content to which the nopopups class is being added. In the Albert Einstein I noticed that the following in the wikitext,

({{IPAc-en|ˈ|aɪ|n|s|t|aɪ|n}};<ref>{{cite book|last=Wells|first=John|authorlink=John C. Wells|title=Longman Pronunciation Dictionary|publisher=Pearson Longman|edition=3rd|date=April 3, 2008|isbn=1-4058-8118-6}}</ref> {{IPA-de|ˈalbɛɐ̯t ˈaɪnʃtaɪn|lang|Albert Einstein german.ogg}};

But in the generated HTML looks something like,

I noticed that the nopopups class is being added for the phonetic text alone. Shouldn't it be added to every content that gets generated as a result of it (the audio file and the word "German" ? Is this done for any reason ?

I noticed that the nopopups class is being added for the phonetic text alone. Shouldn't it be added to every content that gets generated as a result of it (the audio file and the word "German" ? Is this done for any reason ?

Ok, I was partially wrong in my comment. It seems that the nopopups class is being added to to the content generated as a result of the IPAc-en template and not for the content generated as a result of the IPA-de content. Any reasons for this ?

kaldari added a comment.EditedMay 4 2017, 6:58 PM

@Kaartic: I think navigation popups also strips everything in <small> tags, so that takes care of the audio file link and the "German:" label. Here's a screenshot of how it renders in navigation popups:

It does output "( ; ; " as I feared, so adding nopopups to $wgExtractsRemoveClasses probably wouldn't be a good solution. The ideal solution, IMO, would be to have HoverCards and the apps strip out everything inside parenthetical phrases and also anything with class nopopups or noexerpts or inside <small> tags, but it seems there is no clear path to doing that. I think I'm going to change this task back to its original title and description so that it can move forward.

kaldari renamed this task from Hovercards and the apps should honor the nopopups class within page previews to Consider excluding pronunciation guides from TextExtracts.May 4 2017, 7:00 PM
kaldari updated the task description. (Show Details)

So one possible solution on the TextExtracts side would be to add a new mode (controlled by an API parameter) that strips all content delimited by parentheses, brackets, or slashes.

@Kaartic: I think navigation popups also strips everything in <small> tags, so that takes care of the audio file link and the "German:" label.

Ignoring TextExtract and speaking generally, shouldn't the contents generated by the IPA / IPAc template be tagged with the IPA class and shouldn't the IPA text itself be tagged with something more specific (IPA-text) ?

This may help the implementations, that don't use TextExtract, to easily identify the IPA and related text. I guess the TextExtract implementation itself would be simplified as a result of this.

Mhurd added a comment.May 10 2017, 9:45 PM

@kaldari

So one possible solution on the TextExtracts side would be to add a new mode (controlled by an API parameter) that strips all content delimited by parentheses, brackets, or slashes.

^ that would be great! (we're having to strip "(...)", "[...]" and "\...\" manually in the iOS app)

@Kaartic: I think navigation popups also strips everything in <small> tags, so that takes care of the audio file link and the "German:" label. Here's a screenshot of how it renders in navigation popups:


It does output "( ; ; " as I feared, so adding nopopups to $wgExtractsRemoveClasses probably wouldn't be a good solution. The ideal solution, IMO, would be to have HoverCards and the apps strip out everything inside parenthetical phrases and also anything with class nopopups or noexerpts or inside <small> tags, but it seems there is no clear path to doing that. I think I'm going to change this task back to its original title and description so that it can move forward.

By default, Navpopups will not show any templates at all. It also doesn't parse templates. I believe this is primarily to fix:

  • Hatnotes
  • Maintenance banners
  • (Infoboxes/taxoboes are already excluded via Previewmaker.prototype.killBoxTemplates = function ())

I believe editors who use navpopups would generally be happy if they had an option to show parsed templates in the lead sentences, whilst still excluding the 2 items above. This probably just means a coordinated effort to standardize the CSS classes everywhere needed?

If you change the setting to show templates, it looks something like this (I have other tweaks in mine):

Jdlrobson moved this task from Backlog to Watching on the Readers-Web-Backlog (Tracking) board.
Jdlrobson closed this task as Declined.Jul 13 2017, 6:55 PM
Jdlrobson added a subscriber: phuedx.

@phuedx and I have agreed that TextExtracts is not the right place to do it. It should be kept as dumb as possible to cater for all possible use cases and doing this there is not as easy as in RESTBase. For the purpose of summaries we will create a new endpoint.

Thus, we will not be doing this in the TextExtracts API, but you are more than welcome to debate it as being a candidate for inclusion of the new REST API endpoint we plan to build (T113094)