Define/enforce character limits
Closed, ResolvedPublic1 Estimated Story Points
Actions

Description

Now that T145231: TextExtracts exception on very long repetitive content has been fixed, the TextExtracts API can now respond to requests for large numbers of characters without falling over. This is great (!) but we should define/enforce limits for both in order to limit memory consumption.

AC

Define sensible limits for the number of characters/sentences that can be requested.
If the client request exceeds those limits, then:
- Limit the extract, e.g. return 1050 characters of a 3000 character extract.
- Add a warning to the output notifying the client of the action.

Strawman limits

Characters: 1050 – Page Previews, for example, requests 525.

Details

	Subject	Repo	Branch	Lines +/-
	API: Limit maximum number of characters when `exchars` is passed.	mediawiki/extensions/TextExtracts	master	+4 -0

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
		Resolved		Jdlrobson	T156467 Define/enforce character limits

Event Timeline

phuedx created this task.Jan 27 2017, 10:23 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2017, 10:23 AM

phuedx triaged this task as Medium priority.Jan 27 2017, 10:24 AM

phuedx moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

Ping @Dbrant, @Mholloway, @Fjalapeno. I guess my question is "When is an extract not an extract?"

Those limits look reasonable to me.

This limits also seem fine for me. As long as we satisfy the RESTBase requirements and the other existing use cases, this is an easy question. (If we propose limits that will change the output for any of them, then we need to have a bigger conversation)

One comment:

Make the API respond with an error if the client request exceeds those limits.

I would respond with the extract limited to the maximum AND a warning that they requested more than the maximum - But not an error.

In T156467#2993604, @Fjalapeno wrote:

I would respond with the extract limited to the maximum AND a warning that they requested more than the maximum - But not an error.

👍

phuedx updated the task description. (Show Details)Feb 2 2017, 5:16 PM

Note that memory consumption is not an issue here:

Character-level extraction doesn't involve much allocation.
Sentence-level involves a significant amount of allocations, however it depends on the number of sentences in extract and not the number requested. And in both HHVM and modern PHP the overhead is not outrageous.

I would rather see the restrictions as a way to educate people: most users requesting an insane amount of chars/sentences are doing that because they can't read and use it as a way to request everything.

@phuedx ☝️ ping

Looks like currently, we're limiting the maximum number of sentences by 10: https://github.com/wikimedia/mediawiki-extensions-TextExtracts/blob/master/includes/ApiQueryExtracts.php#L367

The only thing to do is limit the number of characters.

Seems ready to me

pmiazga renamed this task from Define/enforce character and sentence limits to Define/enforce character limits.Apr 11 2017, 3:33 PM

pmiazga updated the task description. (Show Details)

Jdlrobson added a parent task: T164010: [EPIC] Strengthen the APIs we provide in reading web maintained extensions.Apr 27 2017, 3:55 PM

Jdlrobson moved this task from Triaged but Future to Upcoming on the Web-Team-Backlog board.May 4 2017, 6:17 PM

Jdlrobson added a project: Readers-Web-Kanbanana-Board-Old.May 17 2017, 4:51 PM

• NHarateh_WMF set the point value for this task to 1.May 17 2017, 5:52 PM

• NHarateh_WMF moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.

Jdlrobson moved this task from Upcoming to 2016-17 Q4 on the Web-Team-Backlog board.May 17 2017, 5:54 PM

• bmansurov claimed this task.May 24 2017, 1:58 PM

• bmansurov moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.

While working on the task I found some unknowns and would like to get some opinions on how to move forward.

Should the character limit apply to the text version of the extract or the HTML version too?
If the limit is also applied to the HTML version, we may end up with malformed HTML, seomthing like this: "1\n\n\nWhen writing systems were created in ancient civilizations, a variety of objects, such..."
The code has a mechanism to fix the issue in #2 that works if $wgUseTidy = true;, but that code has its own problems.
- $wgUseTidy is deprecated.
- Even after trying to tidy using that mechanism, the output still has some problems: 1\n\n\nWhen writing systems were created in ancient civilizations, a variety of objects, such\n...". Notice how the second  is not closed, and how we're shipping extra debug information.

Given the above, what should our course of action be? Should we apply the limit to the text version and create a task to not depend on $wgUseTidy and then apply the limit to the new HTML version? Any other suggegstions?

In T156467#3289380, @bmansurov wrote:

While working on the task I found some unknowns and would like to get some opinions on how to move forward.

Should the character limit apply to the text version of the extract or the HTML version too?

Given the low estimation I would suggest you just limit to the text version of extract. I believe @phuedx had that in mind when proposing the task (but he can correct me if wrong). A new task if necessary can be created for HTML - as you point out there's lots of pre-work to do for that to make that actionable.

In T156467#3289380, @bmansurov wrote:

Given the above, what should our course of action be? Should we apply the limit to the text version and create a task to not depend on $wgUseTidy and then apply the limit to the new HTML version? Any other suggegstions?

#2 and #3 are pre-existing issues AFAICT. One way forward would be to continue setting the limit higher than we first thought to accommodate the need to tweak the response to make it valid HTML and writing up a bug/task to update TextExtracts to use RemexHtml.

OK, thanks for input, both. I'll increase the limit to 1200 characters and apply it to both the text and HTML versions. I'll create a task for the issues I brought up.

Change 355559 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/TextExtracts@master] API: Limit maximum number of characters when exchars is passed.

https://gerrit.wikimedia.org/r/355559

gerritbot added a project: Patch-For-Review.May 24 2017, 10:05 PM

• bmansurov mentioned this in T166272: HTML version of text extracts is not balanced/well formed and naive.May 24 2017, 10:11 PM

• bmansurov moved this task from Doing to Needs Code Review on the Readers-Web-Kanbanana-Board-Old board.

Change 355559 merged by jenkins-bot:
[mediawiki/extensions/TextExtracts@master] API: Limit maximum number of characters when exchars is passed.

https://gerrit.wikimedia.org/r/355559

phuedx reassigned this task from • bmansurov to Jdlrobson.May 25 2017, 11:36 AM

phuedx moved this task from Needs Code Review to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.

phuedx added a subscriber: • bmansurov.

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-05-30_(1.30.0-wmf.3)).May 25 2017, 12:00 PM