Page MenuHomePhabricator

As a user, I'd like lead sentences to be concise so I can get an overview of the topic I'm reading about.
Closed, ResolvedPublic2 Estimated Story Points

Description

Strip all information in parentheses from the first sentence of Wikipedia articles in the app.

At first, put this behind a feature flag so that it's available in all app flavours except production. Do this for English Wikipedia only at first.

Design: https://trello.com/c/jMlYFadF/112-improve-the-readability-of-the-first-sentence

Event Timeline

Deskana raised the priority of this task from to Medium.
Deskana updated the task description. (Show Details)
Deskana moved this task to To Do on the Mobile-App-Sprint-53-Android board.
Deskana subscribed.

I don't understand this design. Are we no longer having the lead image at the top with title description overlay?

@Mhurd We're going to be meeting at 11am next Monday to talk through this more fully.

@Deskana Vibha already set me straight - I thought the mock was an actual mock for presenting the lead sentence to the left of the image - it was just a screenshot of hover card to show cleaner lead sentence. :)

How will a reader, in addition to seeing the transformed lead section, be able to see the untransformed text?

@bearND Exactly. :-)

We'll discuss this, and other problems, on Monday.

Change 197945 had a related patch set uploaded (by Dbrant):
Remove parenthetical information from lead sentences.

https://gerrit.wikimedia.org/r/197945

Change 197945 merged by jenkins-bot:
Remove parenthetical information from lead sentences.

https://gerrit.wikimedia.org/r/197945

@bearND was there an issue with Code review?
I'm not seeing this in the current alpha build and @KLans_WMF has moved this to the current sprint?
Could you please clarify?

@Vibhabamba It's working for me. What article did you try it on?

@Vibhabamba It's feature flagged to only appear on enwiki and for alpha and beta builds. It would be good to mention some good example pages in the Phab tasks and/or Git commit message in the future. I don't remember a good example right now.

Good articles to test:
Bern
Genghis Khan
Epistemology
Hegemony
Dara Shikoh
Jahanara begum
Les Miserables (musical)
Marseille

@Deskana @bearND @Dbrant.
I see the changes on alpha, not on beta, that was confusing me.
I suppose beta will update soon?

bearND renamed this task from As a user, I'd like lead sentences to be consistent so I can get an overview of the topic I'm reading about. to As a user, I'd like lead sentences to be concise so I can get an overview of the topic I'm reading about..Apr 3 2015, 5:57 PM
bearND removed a project: Patch-For-Review.

When we enable link previews should we still do this?

Should we just sanitize the preview rather than the article?

I just brought my previous point up on the mailing list thread as well.

I think if we restrict this sanitizing to article previews we will get a lot less blow back form the community.

I also think UX wise, this makes sense since previews are what people will see first - we make that experience beautiful and then when they drop into the article they can see the canonical information they expect.

cc @dr0ptp4kt @BGerstle-WMF

The information in brackets is useful. Nobody is disputing that.
What is not useful, is its current format.

For a better way of doing this - take a look at this link:
http://dictionary.cambridge.org/us/pronunciation/british/retina

Screenshot attached.

Screen_Shot_2015-04-09_at_4.29.29_PM.png (190×494 px, 27 KB)

We must do the right thing for all users, and not shy away from it because we are scared of anger.
The next step is to build a more structured panel for this information.

Also - responding to this with a focus on all user types is important.
It will set up the precedent for many future discussions.

I am trying to figure out what this is doing exactly. Are there any unit tests for it?
(I did not find anything in the Gerrit changeset nor with a GitHub search).

Has there been any kind of large-scale testing of what this yields?

  • a comparison before/after on a reasonably large dataset ? A measure of how much / what is actually removed (dates? IPA? other?)
  • I understand from the (partial) email thread that the purpose is to increase readability as proxied by the HemingwayApp ; has there been any large-scale measurement that this actually leads to a better HemingwayApp score (or if closed source, Flesch–Kincaid or another), and if so to what extent? (FWIW I suppose it does improve the score, but still [I would tend to assume that any removal, by definition, leads to improved scoring]).

The Android Beta app omits parenthesised text from the first paragraph when viewing articles. As a user, that feels like a bug. It defaces content and can make information written by Wikipedia editors inaccessible to readers.

Excerpts in hover card previews and search results makes perfect sense. Maybe stripping parenthesised sentence parts is too generic, but that's something we'll learn with time. It'd be interesting to explore a more explicit blacklist (e.g. only known objects like IPA templates; or some opt-in/opt-out html classname).

Changing content isn't like restyling templates or collapsing article notices. It leaves no way for readers or editors to understand or mitigate it. Below are a few examples I found after a few minutes of research with people on IRC.

Content bugs:

  • Breaks links.
    • John Smith (comics): Breaks the "2000 AD" and "Crisis" links. These now point to unrelated articles.
    • The First Conspiracy: The link to "The (International) Noise Conspiracy" (now "The Noise Conspiracy") becomes a 404 Not Found. And since this is after redlink removal, it cascades into other bugs when clicked related to accessing articles that don't exist.
  • Breaks titles.
  • Breaks language grammar.
    • It almost always leaves a spurious space before a full-stop or comma.
    • (a,b)-tree: produces "an tree" where "a tree" would be correct.
  • Hides information. There's plenty of relevant information written in parenthesis not typically in an infobox. Not to mention that many articles don't have infoboxes. And infoboxes can be very long, and collapsed by default. Authors add things to content when they have more significance. E.g. on https://en.wikipedia.org/wiki/Carbon_dioxide it mentions "CO2" in parenthesis. The infobox contains it also somewhere deep inside under some subsection, but that's not where one would look when unfamiliar with the subject.

Technical bugs:

  • Breaks attributes used by MediaWiki core, extensions or gadgets. These values may not even be human-readable text content but use parenthesis for other purposes. Perhaps unlikely in the first paragraph though, but worth considering.
  • Breaks DOM integrity. It uses innerHTML, which is not a content property but a serialisation artefact (innerHTML is a misnamed hybrid of toHTML, parseHTML, and replaceChild). Using that results in recursive serialisation to an HTML representation. This string is then manipulated as if it were plain text, and re-parsed to replace the paragraph content. Any event handlers, or other DOM references are now invalidated and lose functionality (while retaining memory consumption for both). Acting on text nodes (not elements) would fix most of these issues.
  • Affects performance. The serialising, re-parsing and memory is unnecessary. Though given its restricted to one paragraph, not major.

Upping this. Has this been taken into account?

Last I heard, the team wasn't planning on pushing this production any time soon, so there wasn't a lot of urgency around fixing issues with it. Debate about whether this is a good idea or not aside, there are quite clearly a few bugs mentioned in this discussion which would need to be fixed before this went to production.

That said, I'm not on the Mobile Apps Team any more, so treat my comments accordingly. :-)

On the iOS side, we are exploring other designs now to add additional panels of information (like previews), separate from the article content itself.

While user testing is ongoing, my inclination is to only attempt to sanitize text (actually use the text snippets API) within those additional panels - while leaving the article content unmodified so users will always have the canonical article available to read, even while on a mobile device.

Separate from this discussion, is the point that @Vibhabamba was making… we should also look for ways to make article content "smarter" (in lieu of removing it) - like adding pronunciation buttons within the article. This way, all the same information is in tact, only it is presented in a way that is more usable.

As an editor and reader, this suppression of bracketed information (typically a person's birth and death dates) feels like a bug. I spent time trying to find a person's birth date to upgrade the article from which it seemed to be missing, went to edit it, found it was there all along.

If we believe this information is important for readers, which is why it is part of the standard lead sentence, why should it be hidden from mobile readers?

Please at least give an option so that "power users" can opt to have this information not suppressed.

See also T102850: Parenthesis stripped from link, causing link target to change (one of the bugs pointed out by @Krinkle above).

Considering the numerous problems listed here, I guess it's unlikely that this feature will be activated in production soon. But I understand it is now used in link preview in some form, where it makes more sense.

I like the idea of an approach focusing more narrowly on the IPA template(s) - which seems to have been the main problem that motivated the creation of this feature. (Even though this would be pretty enwiki-centric initially.)

Another example of an article where removing bracketed text makes a nonsense: https://en.wikipedia.org/wiki/Three_Men_in_a_Boat . Please reconsider this "feature".

I think this experiment was worth trying, but it hasn't really worked out. It introduced a lot of edge cases which could be worked around, but the reward doesn't seem worth the effort. Ultimately, the decision rests with the product owner, @Dbrant.

@PamD Indeed, this was an experiment that we tried in the Beta app (never released to production), and since the algorithm that strips the parenthetical content is relatively naïve, it was doing more harm than good by corrupting link anchors. Until we develop a better algorithm, we've reverted this change (it will be updated in the next Beta release).

Thanks, that's good news. Time and again I've been gnoming away stub-sorting etc on my mobile, thought the dates were missing from lead though present in text or infobox, gone to edit the lead to add them ... and found them sitting there OK, just hidden by the app.