Spike: Alternative to TextExtracts for Popups, Gather, Read more
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdlrobson
	Sep 24 2015, 5:38 PM

Description

Textextracts has various bugs against it. Apps use it.

Questions to answer:

What is a summary of the existing bugs?
- What categories do they fit under?
- What are the common pain points?
What would we gain by using a parsoid/rest based service?
Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

Duration: 8hrs

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	phuedx	T113633 Spike: Alternative to TextExtracts for Popups, Gather, Read more

Event Timeline

Jdlrobson created this task.Sep 24 2015, 5:38 PM

Jdlrobson raised the priority of this task from to Needs Triage.

Jdlrobson updated the task description. (Show Details)

Jdlrobson added projects: Reading-Web-Sprint-57-The Fifth Element, Page-Previews.

Jdlrobson moved this task to Needs Analysis on the Reading-Web-Sprint-57-The Fifth Element board.

Jdlrobson subscribed.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 24 2015, 5:38 PM

Jdlrobson updated the task description. (Show Details)Sep 24 2015, 5:57 PM

Jdlrobson set Security to None.

Jdlrobson renamed this task from Spike: Alternative to TextExtracts for Popups to Spike: Alternative to TextExtracts for Popups, Gather, Read more.Sep 24 2015, 6:00 PM

Jdlrobson added a parent task: T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.Sep 24 2015, 10:00 PM

@phuedx could you weigh in with your thoughts on this?

Jdlrobson triaged this task as Medium priority.Sep 24 2015, 10:42 PM

Jdlrobson moved this task from Needs Analysis to To Do on the Reading-Web-Sprint-57-The Fifth Element board.Sep 29 2015, 4:42 PM

phuedx claimed this task.Oct 5 2015, 12:55 PM

phuedx moved this task from To Do to Doing on the Reading-Web-Sprint-57-The Fifth Element board.

What is a summary of the existing bugs?

The most common pain points of the TextExtracts extension seem to be:

How it handles sentences
How it does/doesn't strip content from the plain text before extraction

At a guess, I'd say that most of the bugs that fall into the latter category could be dealt with quite quickly.

What would we gain by using a parsoid/rest based service?

The Text extraction RFC notes that extract storage – and, presumably, invalidation of that storage – is still an issue, which isn't the case with a RESTBase-proxied service. Other benefits include versioning (!) and automatically gathered performance metrics to name a few.

As I've noted elsewhere, the content translation service does multi-language sentence boundary detection and is proxied by RESTBase. Here though, we also reap the reward of using a service that is being actively developed.

Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

This is quite a broad question and, consequently, my answer can only be "It depends." Relying on a Node.js service makes this feature harder to deploy for third-party wikis – though that decision has clearly already been weighed by the ContentTranslation-CXserver team already. On the other hand, I'm not sure if there's even a "current state" of natural language processing in PHP. I think the sweet spot is somewhere in the middle: proxy a fixed up TestExtracts API with RESTBase.

phuedx moved this task from Doing to Code Review on the Reading-Web-Sprint-57-The Fifth Element board.Oct 5 2015, 3:19 PM

In T113633#1702146, @phuedx wrote:

What is a summary of the existing bugs?

The most common pain points of the TextExtracts extension seem to be:

How it handles sentences

How it does/doesn't strip content from the plain text before extraction

At a guess, I'd say that most of the bugs that fall into the latter category could be dealt with quite quickly.

What about former case? Can you elaborate on former case.

What would we gain by using a parsoid/rest based service?

The Text extraction RFC notes that extract storage – and, presumably, invalidation of that storage – is still an issue, which isn't the case with a RESTBase-proxied service. Other benefits include versioning (!) and automatically gathered performance metrics to name a few.

As I've noted elsewhere, the content translation service does multi-language sentence boundary detection and is proxied by RESTBase. Here though, we also reap the reward of using a service that is being actively developed.

Would it be better to just fix the existing TextExtracts bugs or to create a dedicated service?

This is quite a broad question and, consequently, my answer can only be "It depends." Relying on a Node.js service makes this feature harder to deploy for third-party wikis – though that decision has clearly already been weighed by the ContentTranslation-CXserver team already. On the other hand, I'm not sure if there's even a "current state" of natural language processing in PHP. I think the sweet spot is somewhere in the middle: proxy a fixed up TestExtracts API with RESTBase.

That sounds like a good idea. So TextExtracts would make use of RESTBase service if available?

Would it make sense next sprint to address the TextExtracts bugs and then re-evaluate where we are?

What about former case? Can you elaborate on former case.

T59669 contains a lot of examples. I mentioned this case first because I had a go at fixing that bug a while back and some time later started reading about the complexities of SBD.

That sounds like a good idea. So TextExtracts would make use of RESTBase service if available?

Other way around. We augment the TextExtracts API by proxying it through RESTBase and use that API when we know we can (read: use a configuration variable with a sensible default).

Would it make sense next sprint to address the TextExtracts bugs and then re-evaluate where we are?

Yes.

@bmansurov @Jhernandez any questions before I sign this off?

Jdlrobson moved this task from Code Review to Ready for Signoff on the Reading-Web-Sprint-57-The Fifth Element board.Oct 6 2015, 9:57 PM

no questions

@Jhernandez ! Also @jhobs - any questions before I sign this off?:-)

Seems good.

@Jdlrobson @phuedx Should we talk to the services team about:

Other way around. We augment the TextExtracts API by proxying it through RESTBase and use that API when we know we can (read: use a configuration variable with a sensible default).

If we're going to route all that traffic to there we better make sure it's a good idea first.

(Just for my own clarification, as I understood it: access would be through restbase from the client side. Restbase would proxy and cache to an augmented textextracts api)

Covered all of my questions!

@Jhernandez: Agreed.

SYN @GWicke!

Have pushed the following tasks into next sprint:
T73023 T109869 T74629 T112137 T109867
Let me know if any of them do not seem actionable @phuedx!

Jdlrobson closed this task as Resolved.Oct 9 2015, 8:15 PM

Jdlrobson moved this task from Ready for Signoff to Done on the Reading-Web-Sprint-57-The Fifth Element board.

All of 'em look good @Jdlrobson. They all look like they can be fixed in one place too: the ExtractFormatter class.

• GWicke added a subscriber: • Pchelolo.Oct 16 2015, 8:36 PM

Spike: Alternative to TextExtracts for Popups, Gather, Read moreClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Spike: Alternative to TextExtracts for Popups, Gather, Read more
Closed, ResolvedPublic
Actions

Related Objects
Search...