Page MenuHomePhabricator

Spike (3h): How to do proper sentence detection for TextExtracts
Closed, ResolvedPublic

Description

How do we do proper sentence detection in textextracts?

Find out a proper technical solution and outline a plan for implementation if possible, or a plan for sunsetting it if not.

See blocked task for details about the problems.

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 11 2016, 5:24 PM
Jhernandez updated the task description. (Show Details)May 12 2016, 8:38 AM
Jhernandez added subscribers: Nirzar, JKatzWMF, dr0ptp4kt.

As per T115817#2286302 it seems like this is not going to be very promising.

If we can't find out a good technical implementation, we may have to sunset using exsentences in our extensions given all the problems they are causing.

An alternative could be fetching a fixed amount of chars and showing an ellipsis after it cc/ @Nirzar @JKatzWMF @dr0ptp4kt Something to think about.

This seems to be solved at least for English without PHP (http://nlp.stanford.edu/courses/cs224n/2003/fp/huangy/final_project.doc, https://phabricator.wikimedia.org/T59669#1403468

The issue here is that these kind of algorithms are computationally expensive and need to be run as background jobs that store somewhere that can be consulted by the TextExtracts extension. This seems a lot of work.

We might want to consider allowing editors to mark up text extracts by providing start and end markers. This also seems a lot of work.

Thus I would recommend we avoid using exsentences and use chars, but I guess that's up the people you pinged above. Maybe they think fixing this is worth the work?

On apps, we do what @Jhernandez mentioned. We use 4 lines of text and ellipses at the end. which seems to be a good solution than doing sentence detection because of language constraints. we haven't heard any complaints about this from readers so far.

looking at hovercards on beta, we can do 8 lines with pictures and 6 lines without picture.

Sidenote: The intention of hovercards is to give an idea about what the topic is. in most cases knowing if it's a person, place or event and other short context and not to actually learn deeply about the topic.

phuedx added a subscriber: phuedx.May 13 2016, 8:37 AM

Sidenote: The intention of hovercards is to give an idea about what the topic is. in most cases knowing if it's a person, place or event and other short context and not to actually learn deeply about the topic.

IIRC, in MediaWiki-extensions-Cards, we request the Wikidata description and the text extract, and only render the latter if the former isn't available.

@Jhernandez can you sign off and create/update the card?

@Jdlrobson, @Mholloway, @bearND, and I were discussing this and I asked if we could add some links here to aid in identifying potentially common logic. I'm interested in getting the treatment for cards consistent, both with a short term fix (e.g., just use the parts that are known to work already) and a medium term fix (get to one consistent treatment on the summary endpoint).

@Mholloway and @bearND, would you please confirm the Android client side treatment of extracts?

The iOS preview fetching code is below. @Mhurd, can you remind us what's the thing in the iOS app that spruces up the text extract so it looks nice in cards in the feed? I gather that the ellipsis is Cocoa provided, but I seem to recall it being cleaned up with methods (I think it's summary?) such as those in https://github.com/wikimedia/wikipedia-ios/blob/da2b7fadcb39356c5535eddd8926a2f9990a30de/Wikipedia/Code/NSString%2BWMFHTMLParsing.m.

https://github.com/wikimedia/wikipedia-ios/blob/da2b7fadcb39356c5535eddd8926a2f9990a30de/Wikipedia/Code/NSDictionary%2BWMFCommonParams.m

https://github.com/wikimedia/wikipedia-ios/blob/39ac602194a37b9c23f3fddfccb92e4d52376e99/Wikipedia/Code/WMFArticlePreviewFetcher.m

Some history on the summary RESTbase endpoint that's used by newer Android clients is in https://phabricator.wikimedia.org/T117082, and the server config for that is expressed within this RESTbase endpoint: https://github.com/gwicke/restbase/tree/c0fb3f343b236dbe30c22eaff2c5b7369eeda32a/wikimedia/v1. CC @GWicke for more context.

If we're really interested in a service that provides exsentences-like functionality, then we might want to investigate some of the NLP packages available on npmnlp_compromise looks interesting – with a view to creating a new, specialised RESTBase endpoint.

Summary for now

  • Sentence detection is completely broken and a blocker for hovercards stable
  • There are no good & fast solutions for use in the php land
  • Uses of textextracts:
    • iOS uses exchars
    • Mobile content service uses exsentences
    • We use exsentences in RelatedArticles
  • In Hovercards, Design is ok with using chars limits, as they are doing on iOS
  • There may be some solutions for a node service, but since we have to support a php service and there are no alternatives there it doesn't seem worth it to investigate right now

Correct me if I'm wrong on any of the above.


Temporary follow-up plan

  1. Replace hovercards usage of exsentences with exchars and fix UI (to enable Hovercards rollout to stable)
  2. Deprecate exsentences in TextExtracts api module
  3. Replace usage of exsentences in RelatedArticles and RESTBase summary endpoint
  4. Grep out for other usages on wikimedia repos and transition
  5. Email mediawiki-api-announce

I'm moving and keeping this in sign-off until I receive some comments or the end of the sprint is close and I create the followup plan tasks.

  • iOS uses exchars

Could @Nirzar/@Mhurd summarise how the app generates an extract using the exchars parameter – as much as I like reading through code… ;)

On Android, we request five sentences via exsentences as our raw content, either directly from the MW API[1] (for legacy/fallback API content loading) or via RESTBase's summary endpoint[2], then feed the result into Java's BreakIterator[3] class along with the Locale in order to parse two localized sentences that we ultimately include in the link preview.[4]

I'm having trouble putting my finger on a Phab task discussing this but my understanding is that @Dbrant did some testing and found that our additional client-side processing with BreakIterator produced better results than relying on exsentences alone.

[1] https://github.com/wikimedia/apps-android-wikipedia/blob/master/app/src/main/java/org/wikipedia/server/mwapi/MwPageService.java#L174-L176
[2] https://github.com/wikimedia/restbase/blob/04d442f34e77566e5abdc75ed35c798518b4dd08/v1/summary.yaml#L101
[3] https://developer.android.com/reference/java/text/BreakIterator.html
[4] https://github.com/wikimedia/apps-android-wikipedia/blob/master/app/src/main/java/org/wikipedia/page/linkpreview/LinkPreviewContents.java#L89-L116

Taking a step back, since on Android we're really just requesting a blob of text that's hopefully big enough for us to parse two sentences from on the client side, we could as easily use exchars as exsentences if we decide that we want to sunset the latter.

Of course we'd also be happy to move sentence parsing logic off the client altogether if we can get sentence parsing of sufficient quality on the server side.

If we're really interested in a service that provides exsentences-like functionality, then we might want to investigate some of the NLP packages available on npmnlp_compromise looks interesting – with a view to creating a new, specialised RESTBase endpoint.

Yes, this is a prime RESTBase use case if the need for exsentences-like functionality is strong enough to justify the development/maintenance effort.

If we're not tied to PHP/JS, https://github.com/nltk/nltk seems to be the most sophisticated set of tools for this kind of thing.

Thanks for chiming in @Mholloway.

@dr0ptp4kt seems like moving all uses of exsentences to exchars and deprecating the param is a sensible option for all parties (Android + Web extensions). Thoughts product wise? I'll need to create a bunch of tasks to proceed with this properly (see implementation proposal in T135020#2300556).

@Jhernandez, the proposal makes sense, although I recommend you email mediawiki-api early. One question: @Mholloway, will Android Just Work (TM) once RESTbase is reconfigured to use exchars instead of exsentences?

@Jhernandez note iOS requests 525 characters https://github.com/wikimedia/wikipedia-ios/blob/39ac602194a37b9c23f3fddfccb92e4d52376e99/Wikipedia/Code/WMFNumberOfExtractCharacters.h.

Mholloway added a comment.EditedMay 17 2016, 6:05 PM

@Mholloway, will Android Just Work (TM) once RESTbase is reconfigured to use exchars instead of >exsentences?

In principle, it Should (TM). However, I was just digging around in older versions of our repo on Github, and apparently we did use exchars at one point in time, as I vaguely remembered yesterday, but:

Also, we no longer request "excharacters" from TextExtracts, since it has an issue where 
it's liable to return content that lies beyond the lead section, which might include 
unparsed wikitext, which we certainly don't want.

I'd be interested in hearing whether iOS has run into this and whether it's fixed or workaround-able.

Edit: related task: T101153: Link preview sometimes shows wikitext markup and "Edit"

Adding @dchan @KartikMistry @santhosh, as they have done work on sentence segmentation for ContentTranslation-CXserver: https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/segmentation.

There could be opportunities for sharing expertise and maybe even code.

ContentTranslation-CXserver has sentence segmentation, it can segment DOM elements by producing well balanced tags. Documentation is available at https://www.mediawiki.org/wiki/Content_translation/Developers/Segmentation

We have support for several languages, but languages like Thai are difficult. CX is designed in a way so that it can tolerate if we cannot segment a sentence(falling back to parent block element)

I am not familiar with the usecase you want to address, but to handle all languages it is always better to use a simple segmentation algorithm like cxserver use along with a safe fallback.

@dr0ptp4kt @Jhernandez Upon further investigation it turns out that the T101153 problem can be worked around by setting the exintro param, which is what the iOS app is doing. So, yes, exchars should Just Work for Android.

Sentence detection is completely broken and a blocker for hovercards stable

@Jhernandez: I don't see any evidence for this. There's a single blocking task (T115817) about a problem with lang templates on Russian Wikipedia. How does that constitute "completely broken"? I found 2 other bugs related to exsentences (T100845 and T59669), but both of them seem to actually be fixed now.

Deprecate exsentences in TextExtracts api module

@Jhernandez: This seems ridiculous and dangerous to me. Ridiculous because I don't see any evidence that it is substantially broken. Dangerous because there are a lot of users of exsentences outside of MediaWiki that will be broken by this. Just because it doesn't work well enough for Hovercards doesn't mean it should be killed for all other uses. The perfect should not be the enemy of the good.

@GWicke @santhosh thanks for chiming in, it seems like an interesting thing to use in the restbase summary endpoint.


@Mholloway thanks for the note, I'll keep it in mind when making the migration task.


@kaldari Fair enough, there is no need to deprecate the param, but for the team projects to not use it. It makes sense.

Regarding blocking rollout, that's what I understand from the product owners driving the promotion and what they have talked with users of the initial target wikis. @JKatzWMF and @dr0ptp4kt should be able to provide more details.


TODO next:

  1. Replace hovercards usage of exsentences with exchars and fix UI (to enable Hovercards rollout to stable) (use exintro with exchars to avoid wikitext in output, iOS uses 512 chars)
  2. Replace usage of exsentences in RelatedArticles

TODO soon:

  1. Create tasks for replacing exsentences with exchars and potentially using cxserver's sentence detection in the RESTBase summary endpoint and discuss with the android team and services.
  2. Grep out for other usages on reading-web maintained repos and propose transition to exchars where and if it makes sense (w/ tasks).

Does this sound good? Anything else to take into account?

I'm planning to create followup tasks and resolve this one at the end of tomorrow Friday (europe end) to match the end of the sprint if nothing urgent or blocking comes up.

Yeah, #1 is the blocker for Russian Wikipedia Hovercards, which we'd want to A/B test before rolling out full force. Russian Wikipedia is next in line after Hungarian Wikipedia.