Currently we scrub all parentheticals in the page summary endpoint. This task is about reconsidering and changing this approach (nice summary of current state is here: https://phabricator.wikimedia.org/T91344#4462979)
Identify specific parenthetical elements to exclude from Hovercards, e.g. class="IPA" for pronunciation templates, to let Hovercards be more specific in its content filtering.
This is a followup from the decision to remove all parenthetical content - which was originally done because a few types of content can often take up a lot of characters in the first sentence, which made the hovercard/extract a lot less useful. Those types are:
- pronunciation - e.g. https://en.wikipedia.org/wiki/Germany - At enwiki the template-family already contains class="IPA"
- etymology - e.g. https://en.wikipedia.org/wiki/Epistemology - no standard class, in enwiki
- birth/death dates & location - e.g. https://en.wikipedia.org/wiki/William_Shakespeare - no standard class, in enwiki
- (other?)
This aspect has been discussed in a few topics, such as:
- https://www.mediawiki.org/wiki/Topic:Scdhcjurfi3fe96o
- https://www.mediawiki.org/wiki/Topic:S6q53exrf5ir908w
- https://www.mediawiki.org/wiki/Topic:S789qh0m321280v5
- https://www.mediawiki.org/wiki/Topic:Rvrf5mi32feehqf3
and originally removed in T67138: Hovercards: Fix the bracket removal code (and followup in T69225: Hovercards: Space before removed parentheses should also be removed in the extract).
- Parentheses are not stripped from East Asian languages because different characters are used https://www.mediawiki.org/wiki/Topic:Tn0k9fjn0g8e9fpe
We'd like to do it more precisely/judiciously, but without negatively affecting other re-users of extracts... - so we can't just re-use the class="noexcerpt" (e.g. template:IPAc-en) or class="nopopups" (e.g. template:H:IPA) without deeply understanding what those classes already cover.
Can anyone help us map out the existing uses of class="noexcerpt" and class="nopopups" ?
Or suggest other ideas?