Page MenuHomePhabricator

Blacklist automatic updates for especially expensive pages
Closed, ResolvedPublic

Description

Some huge pages edited by bots are using a disproportionate amount of resources in Parsoid and RESTBase. Examples for this are http://it.wikipedia.org/wiki/Utente:Biobot/log, or
https://ur.wikipedia.org/wiki/%D9%86%D8%A7%D9%85_%D9%85%D9%82%D8%A7%D9%85%D8%A7%D8%AA_%DA%A9%DB%92. Both are huge pages that are edited very frequently by bots. They are both very unlikely to be edited in VE.

In Parsoid, such huge pages (>2mb wikitext) cause high load and timeouts.

In RESTBase, those pages created wide rows T94121, and use a lot of storage space. For example, http://it.wikipedia.org/wiki/Utente:Biobot/log uses roughly 30G of storage per month alone. T94121 discusses options for handling such cases more efficiently, but this will take some time to implement.

To reduce the unnecessary resource consumption in the meantime, I propose to add an update blacklist in RESTBase, blocking automatic re-renders for these pages from the job queue. A dozen or two of pages on this list would save a lot of resources in Parsoid and RESTBase until magic solutions for dealing with this abuse have arrived.

See also:

Event Timeline

GWicke raised the priority of this task from to High.
GWicke updated the task description. (Show Details)
GWicke added projects: RESTBase, Parsoid.
GWicke subscribed.
GWicke set Security to None.
GWicke updated the task description. (Show Details)

There was a rather significant storage p99 impact from deleting HTML and data-parsoid for the the two titles mentioned in the task description, and truncating https://it.wikipedia.org/wiki/Utente:Biobot/log on-wiki:

pasted_file (1×1 px, 249 KB)

HTML for those two titles used about 37G and 27G, respectively.

GWicke claimed this task.

This is now deployed.