Page MenuHomePhabricator

Define/enforce character limits
Closed, ResolvedPublic1 Estimated Story Points

Description

Now that T145231: TextExtracts exception on very long repetitive content has been fixed, the TextExtracts API can now respond to requests for large numbers of characters without falling over. This is great (!) but we should define/enforce limits for both in order to limit memory consumption.

AC

  • Define sensible limits for the number of characters/sentences that can be requested.
  • If the client request exceeds those limits, then:
    • Limit the extract, e.g. return 1050 characters of a 3000 character extract.
    • Add a warning to the output notifying the client of the action.

Strawman limits

Characters: 1050 – Page Previews, for example, requests 525.

Event Timeline

phuedx triaged this task as Medium priority.Jan 27 2017, 10:24 AM
phuedx moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

Ping @Dbrant, @Mholloway, @Fjalapeno. I guess my question is "When is an extract not an extract?"

Mholloway updated the task description. (Show Details)

Those limits look reasonable to me.

This limits also seem fine for me. As long as we satisfy the RESTBase requirements and the other existing use cases, this is an easy question. (If we propose limits that will change the output for any of them, then we need to have a bigger conversation)

One comment:

Make the API respond with an error if the client request exceeds those limits.

I would respond with the extract limited to the maximum AND a warning that they requested more than the maximum - But not an error.

I would respond with the extract limited to the maximum AND a warning that they requested more than the maximum - But not an error.

👍

Note that memory consumption is not an issue here:

  • Character-level extraction doesn't involve much allocation.
  • Sentence-level involves a significant amount of allocations, however it depends on the number of sentences in extract and not the number requested. And in both HHVM and modern PHP the overhead is not outrageous.

I would rather see the restrictions as a way to educate people: most users requesting an insane amount of chars/sentences are doing that because they can't read and use it as a way to request everything.

Looks like currently, we're limiting the maximum number of sentences by 10: https://github.com/wikimedia/mediawiki-extensions-TextExtracts/blob/master/includes/ApiQueryExtracts.php#L367

The only thing to do is limit the number of characters.

pmiazga renamed this task from Define/enforce character and sentence limits to Define/enforce character limits.Apr 11 2017, 3:33 PM
pmiazga updated the task description. (Show Details)

While working on the task I found some unknowns and would like to get some opinions on how to move forward.

  1. Should the character limit apply to the text version of the extract or the HTML version too?
  2. If the limit is also applied to the HTML version, we may end up with malformed HTML, seomthing like this: "<p>1\n</p>\n\n<p>When writing systems were created in ancient civilizations, a variety of objects, such..."
  3. The code has a mechanism to fix the issue in #2 that works if $wgUseTidy = true;, but that code has its own problems.
    • $wgUseTidy is deprecated.
    • Even after trying to tidy using that mechanism, the output still has some problems: <p>1\n</p>\n\n<p>When writing systems were created in ancient civilizations, a variety of objects, such\n<!-- Tidy found serious XHTML errors -->...". Notice how the second <p> is not closed, and how we're shipping extra debug information.

Given the above, what should our course of action be? Should we apply the limit to the text version and create a task to not depend on $wgUseTidy and then apply the limit to the new HTML version? Any other suggegstions?

While working on the task I found some unknowns and would like to get some opinions on how to move forward.

  1. Should the character limit apply to the text version of the extract or the HTML version too?

Given the low estimation I would suggest you just limit to the text version of extract. I believe @phuedx had that in mind when proposing the task (but he can correct me if wrong). A new task if necessary can be created for HTML - as you point out there's lots of pre-work to do for that to make that actionable.

Given the above, what should our course of action be? Should we apply the limit to the text version and create a task to not depend on $wgUseTidy and then apply the limit to the new HTML version? Any other suggegstions?

#2 and #3 are pre-existing issues AFAICT. One way forward would be to continue setting the limit higher than we first thought to accommodate the need to tweak the response to make it valid HTML and writing up a bug/task to update TextExtracts to use RemexHtml.

OK, thanks for input, both. I'll increase the limit to 1200 characters and apply it to both the text and HTML versions. I'll create a task for the issues I brought up.

Change 355559 had a related patch set uploaded (by Bmansurov; owner: Bmansurov):
[mediawiki/extensions/TextExtracts@master] API: Limit maximum number of characters when exchars is passed.

https://gerrit.wikimedia.org/r/355559

Change 355559 merged by jenkins-bot:
[mediawiki/extensions/TextExtracts@master] API: Limit maximum number of characters when exchars is passed.

https://gerrit.wikimedia.org/r/355559

Jdforrester-WMF subscribed.

Mass-moving all items tagged for MediaWiki 1.30.0-wmf.3, as that was never released; instead, we're using -wmf.4.