Page MenuHomePhabricator

Spike: Alternative to TextExtracts for Popups, Gather, Read more
Closed, ResolvedPublic

Description

Textextracts has various bugs against it. Apps use it.

Questions to answer:

  • What is a summary of the existing bugs?
    • What categories do they fit under?
    • What are the common pain points?
  • What would we gain by using a parsoid/rest based service?
  • Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

Duration: 8hrs

Related Objects

Event Timeline

Jdlrobson raised the priority of this task from to Needs Triage.
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Jdlrobson.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2015, 5:38 PM
Jdlrobson updated the task description. (Show Details)Sep 24 2015, 5:57 PM
Jdlrobson set Security to None.
Jdlrobson renamed this task from Spike: Alternative to TextExtracts for Popups to Spike: Alternative to TextExtracts for Popups, Gather, Read more.Sep 24 2015, 6:00 PM
Jdlrobson updated the task description. (Show Details)Sep 24 2015, 10:02 PM
Jdlrobson added a subscriber: phuedx.

@phuedx could you weigh in with your thoughts on this?

Jdlrobson triaged this task as Medium priority.Sep 24 2015, 10:42 PM
phuedx claimed this task.Oct 5 2015, 12:55 PM
phuedx moved this task from To Do to Doing on the Reading-Web-Sprint-57-The Fifth Element board.
phuedx added a comment.Oct 5 2015, 3:19 PM

What is a summary of the existing bugs?

The most common pain points of the TextExtracts extension seem to be:

  • How it handles sentences
  • How it does/doesn't strip content from the plain text before extraction

At a guess, I'd say that most of the bugs that fall into the latter category could be dealt with quite quickly.

What would we gain by using a parsoid/rest based service?

The Text extraction RFC notes that extract storage – and, presumably, invalidation of that storage – is still an issue, which isn't the case with a RESTBase-proxied service. Other benefits include versioning (!) and automatically gathered performance metrics to name a few.

As I've noted elsewhere, the content translation service does multi-language sentence boundary detection and is proxied by RESTBase. Here though, we also reap the reward of using a service that is being actively developed.

Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

This is quite a broad question and, consequently, my answer can only be "It depends." Relying on a Node.js service makes this feature harder to deploy for third-party wikis – though that decision has clearly already been weighed by the ContentTranslation-CXserver team already. On the other hand, I'm not sure if there's even a "current state" of natural language processing in PHP. I think the sweet spot is somewhere in the middle: proxy a fixed up TestExtracts API with RESTBase.

What is a summary of the existing bugs?

The most common pain points of the TextExtracts extension seem to be:

  • How it handles sentences
  • How it does/doesn't strip content from the plain text before extraction

At a guess, I'd say that most of the bugs that fall into the latter category could be dealt with quite quickly.

What about former case? Can you elaborate on former case.

What would we gain by using a parsoid/rest based service?

The Text extraction RFC notes that extract storage – and, presumably, invalidation of that storage – is still an issue, which isn't the case with a RESTBase-proxied service. Other benefits include versioning (!) and automatically gathered performance metrics to name a few.
As I've noted elsewhere, the content translation service does multi-language sentence boundary detection and is proxied by RESTBase. Here though, we also reap the reward of using a service that is being actively developed.

Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

This is quite a broad question and, consequently, my answer can only be "It depends." Relying on a Node.js service makes this feature harder to deploy for third-party wikis – though that decision has clearly already been weighed by the ContentTranslation-CXserver team already. On the other hand, I'm not sure if there's even a "current state" of natural language processing in PHP. I think the sweet spot is somewhere in the middle: proxy a fixed up TestExtracts API with RESTBase.

That sounds like a good idea. So TextExtracts would make use of RESTBase service if available?

Would it make sense next sprint to address the TextExtracts bugs and then re-evaluate where we are?

phuedx added a comment.Oct 6 2015, 9:52 AM

What about former case? Can you elaborate on former case.

T59669 contains a lot of examples. I mentioned this case first because I had a go at fixing that bug a while back and some time later started reading about the complexities of SBD.

That sounds like a good idea. So TextExtracts would make use of RESTBase service if available?

Other way around. We augment the TextExtracts API by proxying it through RESTBase and use that API when we know we can (read: use a configuration variable with a sensible default).

Would it make sense next sprint to address the TextExtracts bugs and then re-evaluate where we are?

Yes.

@bmansurov @Jhernandez any questions before I sign this off?

Jdlrobson added a subscriber: jhobs.EditedOct 7 2015, 4:56 PM

@Jhernandez ! Also @jhobs - any questions before I sign this off?:-)

Seems good.

@Jdlrobson @phuedx Should we talk to the services team about:

Other way around. We augment the TextExtracts API by proxying it through RESTBase and use that API when we know we can (read: use a configuration variable with a sensible default).

If we're going to route all that traffic to there we better make sure it's a good idea first.

(Just for my own clarification, as I understood it: access would be through restbase from the client side. Restbase would proxy and cache to an augmented textextracts api)

Covered all of my questions!

Have pushed the following tasks into next sprint:
T73023 T109869 T74629 T112137 T109867
Let me know if any of them do not seem actionable @phuedx!

Jdlrobson closed this task as Resolved.Oct 9 2015, 8:15 PM

All of 'em look good @Jdlrobson. They all look like they can be fixed in one place too: the ExtractFormatter class.