How do we do proper sentence detection in textextracts?
Find out a proper technical solution and outline a plan for implementation if possible, or a plan for sunsetting it if not.
See blocked task for details about the problems.
How do we do proper sentence detection in textextracts?
Find out a proper technical solution and outline a plan for implementation if possible, or a plan for sunsetting it if not.
See blocked task for details about the problems.
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | Dereckson | T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki) | |||
| Resolved | Jdlrobson | T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature | |||
| Resolved | Jdlrobson | T132602 [GOAL] Roll Hovercards out on smaller wikipedia project | |||
| Resolved | putnik | T115817 Template "lang" badly processed | |||
| Resolved | • Jhernandez | T135020 Spike (3h): How to do proper sentence detection for TextExtracts |
As per T115817#2286302 it seems like this is not going to be very promising.
If we can't find out a good technical implementation, we may have to sunset using exsentences in our extensions given all the problems they are causing.
An alternative could be fetching a fixed amount of chars and showing an ellipsis after it cc/ @Nirzar @JKatzWMF @dr0ptp4kt Something to think about.
This seems to be solved at least for English without PHP (http://nlp.stanford.edu/courses/cs224n/2003/fp/huangy/final_project.doc, https://phabricator.wikimedia.org/T59669#1403468
The issue here is that these kind of algorithms are computationally expensive and need to be run as background jobs that store somewhere that can be consulted by the TextExtracts extension. This seems a lot of work.
We might want to consider allowing editors to mark up text extracts by providing start and end markers. This also seems a lot of work.
Thus I would recommend we avoid using exsentences and use chars, but I guess that's up the people you pinged above. Maybe they think fixing this is worth the work?
On apps, we do what @Jhernandez mentioned. We use 4 lines of text and ellipses at the end. which seems to be a good solution than doing sentence detection because of language constraints. we haven't heard any complaints about this from readers so far.
looking at hovercards on beta, we can do 8 lines with pictures and 6 lines without picture.
Sidenote: The intention of hovercards is to give an idea about what the topic is. in most cases knowing if it's a person, place or event and other short context and not to actually learn deeply about the topic.
IIRC, in MediaWiki-extensions-Cards, we request the Wikidata description and the text extract, and only render the latter if the former isn't available.
@Jdlrobson, @Mholloway, @bearND, and I were discussing this and I asked if we could add some links here to aid in identifying potentially common logic. I'm interested in getting the treatment for cards consistent, both with a short term fix (e.g., just use the parts that are known to work already) and a medium term fix (get to one consistent treatment on the summary endpoint).
@Mholloway and @bearND, would you please confirm the Android client side treatment of extracts?
The iOS preview fetching code is below. @Mhurd, can you remind us what's the thing in the iOS app that spruces up the text extract so it looks nice in cards in the feed? I gather that the ellipsis is Cocoa provided, but I seem to recall it being cleaned up with methods (I think it's summary?) such as those in https://github.com/wikimedia/wikipedia-ios/blob/da2b7fadcb39356c5535eddd8926a2f9990a30de/Wikipedia/Code/NSString%2BWMFHTMLParsing.m.
Some history on the summary RESTbase endpoint that's used by newer Android clients is in https://phabricator.wikimedia.org/T117082, and the server config for that is expressed within this RESTbase endpoint: https://github.com/gwicke/restbase/tree/c0fb3f343b236dbe30c22eaff2c5b7369eeda32a/wikimedia/v1. CC @GWicke for more context.
If we're really interested in a service that provides exsentences-like functionality, then we might want to investigate some of the NLP packages available on npm – nlp_compromise looks interesting – with a view to creating a new, specialised RESTBase endpoint.
Correct me if I'm wrong on any of the above.
I'm moving and keeping this in sign-off until I receive some comments or the end of the sprint is close and I create the followup plan tasks.
On Android, we request five sentences via exsentences as our raw content, either directly from the MW API[1] (for legacy/fallback API content loading) or via RESTBase's summary endpoint[2], then feed the result into Java's BreakIterator[3] class along with the Locale in order to parse two localized sentences that we ultimately include in the link preview.[4]
I'm having trouble putting my finger on a Phab task discussing this but my understanding is that @Dbrant did some testing and found that our additional client-side processing with BreakIterator produced better results than relying on exsentences alone.
[1] https://github.com/wikimedia/apps-android-wikipedia/blob/master/app/src/main/java/org/wikipedia/server/mwapi/MwPageService.java#L174-L176
[2] https://github.com/wikimedia/restbase/blob/04d442f34e77566e5abdc75ed35c798518b4dd08/v1/summary.yaml#L101
[3] https://developer.android.com/reference/java/text/BreakIterator.html
[4] https://github.com/wikimedia/apps-android-wikipedia/blob/master/app/src/main/java/org/wikipedia/page/linkpreview/LinkPreviewContents.java#L89-L116
Taking a step back, since on Android we're really just requesting a blob of text that's hopefully big enough for us to parse two sentences from on the client side, we could as easily use exchars as exsentences if we decide that we want to sunset the latter.
Of course we'd also be happy to move sentence parsing logic off the client altogether if we can get sentence parsing of sufficient quality on the server side.
Yes, this is a prime RESTBase use case if the need for exsentences-like functionality is strong enough to justify the development/maintenance effort.
If we're not tied to PHP/JS, https://github.com/nltk/nltk seems to be the most sophisticated set of tools for this kind of thing.
Thanks for chiming in @Mholloway.
@dr0ptp4kt seems like moving all uses of exsentences to exchars and deprecating the param is a sensible option for all parties (Android + Web extensions). Thoughts product wise? I'll need to create a bunch of tasks to proceed with this properly (see implementation proposal in T135020#2300556).
@Jhernandez, the proposal makes sense, although I recommend you email mediawiki-api early. One question: @Mholloway, will Android Just Work (TM) once RESTbase is reconfigured to use exchars instead of exsentences?
@Jhernandez note iOS requests 525 characters https://github.com/wikimedia/wikipedia-ios/blob/39ac602194a37b9c23f3fddfccb92e4d52376e99/Wikipedia/Code/WMFNumberOfExtractCharacters.h.
In principle, it Should (TM). However, I was just digging around in older versions of our repo on Github, and apparently we did use exchars at one point in time, as I vaguely remembered yesterday, but:
Also, we no longer request "excharacters" from TextExtracts, since it has an issue where it's liable to return content that lies beyond the lead section, which might include unparsed wikitext, which we certainly don't want.
I'd be interested in hearing whether iOS has run into this and whether it's fixed or workaround-able.
Edit: related task: T101153: Link preview sometimes shows wikitext markup and "Edit"
Adding @dchan @KartikMistry @santhosh, as they have done work on sentence segmentation for ContentTranslation-CXserver: https://github.com/wikimedia/mediawiki-services-cxserver/tree/master/segmentation.
There could be opportunities for sharing expertise and maybe even code.
ContentTranslation-CXserver has sentence segmentation, it can segment DOM elements by producing well balanced tags. Documentation is available at https://www.mediawiki.org/wiki/Content_translation/Developers/Segmentation
We have support for several languages, but languages like Thai are difficult. CX is designed in a way so that it can tolerate if we cannot segment a sentence(falling back to parent block element)
I am not familiar with the usecase you want to address, but to handle all languages it is always better to use a simple segmentation algorithm like cxserver use along with a safe fallback.
@dr0ptp4kt @Jhernandez Upon further investigation it turns out that the T101153 problem can be worked around by setting the exintro param, which is what the iOS app is doing. So, yes, exchars should Just Work for Android.
Sentence detection is completely broken and a blocker for hovercards stable
@Jhernandez: I don't see any evidence for this. There's a single blocking task (T115817) about a problem with lang templates on Russian Wikipedia. How does that constitute "completely broken"? I found 2 other bugs related to exsentences (T100845 and T59669), but both of them seem to actually be fixed now.
Deprecate exsentences in TextExtracts api module
@Jhernandez: This seems ridiculous and dangerous to me. Ridiculous because I don't see any evidence that it is substantially broken. Dangerous because there are a lot of users of exsentences outside of MediaWiki that will be broken by this. Just because it doesn't work well enough for Hovercards doesn't mean it should be killed for all other uses. The perfect should not be the enemy of the good.
@GWicke @santhosh thanks for chiming in, it seems like an interesting thing to use in the restbase summary endpoint.
@Mholloway thanks for the note, I'll keep it in mind when making the migration task.
@kaldari Fair enough, there is no need to deprecate the param, but for the team projects to not use it. It makes sense.
Regarding blocking rollout, that's what I understand from the product owners driving the promotion and what they have talked with users of the initial target wikis. @JKatzWMF and @dr0ptp4kt should be able to provide more details.
Does this sound good? Anything else to take into account?
I'm planning to create followup tasks and resolve this one at the end of tomorrow Friday (europe end) to match the end of the sprint if nothing urgent or blocking comes up.
Yeah, #1 is the blocker for Russian Wikipedia Hovercards, which we'd want to A/B test before rolling out full force. Russian Wikipedia is next in line after Hungarian Wikipedia.