Certain pages like complex user sandboxes, galleries, bot-generated pages with data are so huge or complex, that Parsoid constantly failed parsing them. These pages can't even be opened in the browser, so no-one ever uses VisualEditor on them, so trying to reparse them is simply a waste of Parsoid resources and restbase storage. What makes the problem even harder is that these pages normally trasclude hundreds of other pages/templates/images, so they are reparsed very frequently.
Up until now our solution to the problem was maintaining the static manually generated blacklist of pages where entries are added when they cause problems in RB or Parsoid. This is a very labour-involved approach since every change to the blacklist needs RESTBase deployment and thus takes quite some time. Also, in case so extremely complex or frequently re rendered page appeares, it takes lots of time to find and blacklist it, creating a real danger for Parsoid outage.
Recently, a special visualization was added to ChangeProp log stash dashboard, that's creating a histogram of retry limit exceeded logs with rerender status >= 500. After a day it seems that pages with high values of failures on the histogram are really good candidates for blacklisting. So the idea is to automate blacklisting.
Detecting blacklist candidates
The page should be blacklisted if the rate of fatal rendering errors for it is higher then X. Options for rate calculation:
- In memory per node. A disadvantage is not detecting cross-rule rate, for example with transclusions + new revisions)
- Some external store like Redis (which would be perfect). The resource requirements for Redis are REALLY low, lower then one request per second.
Storage for blacklisted pages
When the page is blacklisted, it has to be stored somewhere with some TTL like 1-2 weeks in case the page is later fixed.
Also, we should be able to list all the pages in the storage for debugging and to manually delete entries if it was blacklisted by mistake. Options for storage:
- Special table in RESTBase with private GET and PUT endpoints.
- A static field in page_revisions table with a column-level TTL + private PUT endpoint. This is likely a no-go, since we wouldn't be able to list all the blacklisted pages, only check whether a particular page is blacklisted. Also, we wouldn't be able to blacklist the page for other services. The advantage of this approach is that we fetch the page_revision anyway, so no additional requests to storage, no additional latency.
- Storage in ChangeProp. We might store this in Redis as well. We will have to hit Redis on all the 'level one' rerenders like new revision or tranclusion update. All the derived rerenders, like mobile apps or Varnish purges can ignore the blacklist. Currently this will result in about 200 req/s
It seems like the best solution would involve adding Redis store to ChangeProp, thus resolving T157089 first, so making it a subtask.