Page MenuHomePhabricator

Add a link: sentence highlighting research spike
Open, MediumPublic

Description

NOTE: this is not slated for the initial release.

In user tests for the "add a link" workflow, we found some evidence that users are more likely to read the context of the sentence with the active link suggestion if the sentence is highlighted. This may cause them not to zoom in to just the word, but rather to consider whether it should be linked in the context of the whole sentence. See screenshot from prototype below.

In this research spike, we want to find out how we might be able to parse sentences out and highlight them.

image.png (370×759 px, 94 KB)

Event Timeline

MMiller_WMF renamed this task from Add a link: sentence highlighting to Add a link: sentence highlighting research spike.Dec 8 2020, 3:52 PM
MMiller_WMF updated the task description. (Show Details)

@MMiller_WMF do you still want to do some research on this? If so let's make a plan for when/how to do that.

@kostajh -- no, we are cutting this from the scope of the initial release. I will change the task description accordinging.

OK, I'm marking it declined for now then, it's still part of the tree of tasks and linked on the add-link board so we can find it later if we want.

MMiller_WMF added subscribers: pau, santhosh.

I learned from @pau that the Language team is able to highlight sentences for the purposes of mobile section translation. They are using an algorithm that @santhosh has created and polished over time, and that we might be able to reuse. @santhosh -- if you're able to post any information here, that would be really helpful. Otherwise, we can come ask you once the team starts to think about this task. Thank you!

Re-opening to discuss.

The idea @Tgr mentioned in T271124#6759657 following a comment from @MGerlach is to use the existing pipeline processing that happens in the link recommendation service to return the sentence context along with the rest of the link recommendation data -- the service already has some concept of what sentence the link phrase is in, so we could just attempt to use that.

I have a feeling it's not going to be that straightforward though.

If we took the T267694#6764244 approach to identifying the anchor, we would get this for free (assuming the performance of nltk.tokenize is acceptable) as the same technique could be used to convert sentence boundaries from wikitext to HTML.

If we took the T267694#6764244 approach to identifying the anchor, we would get this for free (assuming the performance of nltk.tokenize is acceptable) as the same technique could be used to convert sentence boundaries from wikitext to HTML.

If that approach turns out to be difficult, @Tgr and I were discussing if a low-tech hack version of highlighting would also provide some benefit -- highlight a couple of words before and after the phrase and don't worry about exact sentence boundaries. @RHo is that something we should keep in mind as an option if sentence boundaries turn out to be too difficult to fit into our current timeline, or would that type of highlighting not be so helpful?

I learned from @pau that the Language team is able to highlight sentences for the purposes of mobile section translation. They are using an algorithm that @santhosh has created and polished over time, and that we might be able to reuse. @santhosh -- if you're able to post any information here, that would be really helpful. Otherwise, we can come ask you once the team starts to think about this task. Thank you!

Until Santhosh is back, I gathered some details from @ngkountas and @Nikerabbit. They think that using the current implementation used in Content Translation server (CX-cxserver) as-is may be overkill for this case.

In our current approach, we are sending api requests to fetch page content segmented into content translation segments (these apis are used for both Content and Section Translation), then transforming these segments into Paragraph and Sentence models.

This requires a lot of extra DOM manipulation. In order to provide this “sentence-by-sentence” functionality (that among other things, allows us to highlight sentences) we have to serve (display to the users) the content ourselves, meaning we create divs and paragraphs based on the Paragraph and Sentence models mentioned above. That basically means we do not use MediaWiki’s mechanism to render section contents but we serve it in our own special way.

(The above is my attempt to summarize different ideas Nik and Niklas brought to the conversations. If further details are needed it may be better for the engineers on each team to discuss directly)

I learned from @pau that the Language team is able to highlight sentences for the purposes of mobile section translation. They are using an algorithm that @santhosh has created and polished over time, and that we might be able to reuse. @santhosh -- if you're able to post any information here, that would be really helpful. Otherwise, we can come ask you once the team starts to think about this task. Thank you!

Until Santhosh is back, I gathered some details from @ngkountas and @Nikerabbit. They think that using the current implementation used in Content Translation server (CX-cxserver) as-is may be overkill for this case.

In our current approach, we are sending api requests to fetch page content segmented into content translation segments (these apis are used for both Content and Section Translation), then transforming these segments into Paragraph and Sentence models.

This requires a lot of extra DOM manipulation. In order to provide this “sentence-by-sentence” functionality (that among other things, allows us to highlight sentences) we have to serve (display to the users) the content ourselves, meaning we create divs and paragraphs based on the Paragraph and Sentence models mentioned above. That basically means we do not use MediaWiki’s mechanism to render section contents but we serve it in our own special way.

(The above is my attempt to summarize different ideas Nik and Niklas brought to the conversations. If further details are needed it may be better for the engineers on each team to discuss directly)

Thanks @Pginer-WMF.

So, I think that means we have two options:

  1. use a naïve highlighting that attempts to go a couple of words in either direction and pays no attention to sentence structure. e.g. if the link text is "fox" (bold), then the highlighting (italics) would look like "the quick brown fox jumps over the lazy dog".
  2. attempt to use the nltk library used during the querying of the service to attempt to get the contours of the sentence (wikitext offsets, plus before/after context like we do for the link text), and attempt to use that.

Option 1 is less work, so I'd propose we start there and make a task to try option 2 if we find we have enough time to pursue it.

Change 665998 had a related patch set uploaded (by Kosta Harlan; owner: Kosta Harlan):
[research/mwaddlink@main] [WIP] Return sentence context

https://gerrit.wikimedia.org/r/665998

kostajh moved this task from Inbox to Triaged on the Growth-Team board.
kostajh added a subscriber: KStoller-WMF.

cc @KStoller-WMF just so you know about this task; I'm not proposing we take it on anytime soon. But thought you'd be interested in it.

Change 665998 abandoned by Kosta Harlan:

[research/mwaddlink@main] [WIP] Return sentence context

Reason:

https://gerrit.wikimedia.org/r/665998

FYI @kostajh and @KStoller-WMF - Editing team are exploring this space again for the Edit check project - see T324363