Page MenuHomePhabricator

Automate RESTBase blacklisting
Closed, ResolvedPublic

Description

Certain pages like complex user sandboxes, galleries, bot-generated pages with data are so huge or complex, that Parsoid constantly failed parsing them. These pages can't even be opened in the browser, so no-one ever uses VisualEditor on them, so trying to reparse them is simply a waste of Parsoid resources and restbase storage. What makes the problem even harder is that these pages normally trasclude hundreds of other pages/templates/images, so they are reparsed very frequently.

Up until now our solution to the problem was maintaining the static manually generated blacklist of pages where entries are added when they cause problems in RB or Parsoid. This is a very labour-involved approach since every change to the blacklist needs RESTBase deployment and thus takes quite some time. Also, in case so extremely complex or frequently re rendered page appeares, it takes lots of time to find and blacklist it, creating a real danger for Parsoid outage.

Recently, a special visualization was added to ChangeProp log stash dashboard, that's creating a histogram of retry limit exceeded logs with rerender status >= 500. After a day it seems that pages with high values of failures on the histogram are really good candidates for blacklisting. So the idea is to automate blacklisting.

Detecting blacklist candidates

The page should be blacklisted if the rate of fatal rendering errors for it is higher then X. Options for rate calculation:

  1. In memory per node. A disadvantage is not detecting cross-rule rate, for example with transclusions + new revisions)
  2. Some external store like Redis (which would be perfect). The resource requirements for Redis are REALLY low, lower then one request per second.

Storage for blacklisted pages

When the page is blacklisted, it has to be stored somewhere with some TTL like 1-2 weeks in case the page is later fixed.
Also, we should be able to list all the pages in the storage for debugging and to manually delete entries if it was blacklisted by mistake. Options for storage:

  1. Special table in RESTBase with private GET and PUT endpoints.
  2. A static field in page_revisions table with a column-level TTL + private PUT endpoint. This is likely a no-go, since we wouldn't be able to list all the blacklisted pages, only check whether a particular page is blacklisted. Also, we wouldn't be able to blacklist the page for other services. The advantage of this approach is that we fetch the page_revision anyway, so no additional requests to storage, no additional latency.
  3. Storage in ChangeProp. We might store this in Redis as well. We will have to hit Redis on all the 'level one' rerenders like new revision or tranclusion update. All the derived rerenders, like mobile apps or Varnish purges can ignore the blacklist. Currently this will result in about 200 req/s

It seems like the best solution would involve adding Redis store to ChangeProp, thus resolving T157089 first, so making it a subtask.

Event Timeline

@Joe You've told that we would be able to use an existing Redis cluster for reduplication in ChangeProp, but it seems we've found a more pressing issue that would also benefit from adding a storage. To sum up the above, our uses for Redis are:

  1. A very low-volume rate-limiting with < 1 limiting candidate per second and very long limiting periods. The limit would be something > 10 per day.
  2. Maintaining the list of blacklisted resources, with TTL, with very low frequency updates and ~200 checks whether a resource is blacklisted or not per second. The size of the list is very small, in order of hundreds of domain+title pairs.

What would you recommend? Can we use existing Redis for this as well? If yes, then which cluster? Or is it better to set up a designated Redis cluster? Potentially, we would expand usage of Redis inn ChangeProp with rate-limiting rerenders for all pages, which would add a more significant load on it.

It might help to think of this as a delay-tiered rate limiting setup, rather than a binary blacklist. Depending on the exact deduplication requirements on the job queue side, we might even be able to fold that into the same mechanism.

Update requirements are basically one per executed changeprop task.

Update requirements are basically one per executed changeprop task.

Not quite. We only have to update if we've reached the retry limit with response error code >= 500, which happens quite rarely, so the update rate would be VERY low. Also, the blacklist wouldn't be constructed for all the tasks, only for 'level 1' tasks like page_edit or transclusion_update. If the page is blacklisted for HTML, the subsequent events wouldn't be generated, so no need to track, for example, mobile updates

Update requirements are basically one per executed changeprop task.

Not quite. We only have to update if we've reached the retry limit with response error code >= 500, which happens quite rarely, so the update rate would be VERY low.

This is assuming that we don't want to rate limit successful jobs. Those can still use significant resources. Many of the wide row issues were caused by pages that rendered just fine, but were re-rendered over & over.

Also, the blacklist wouldn't be constructed for all the tasks, only for 'level 1' tasks like page_edit or transclusion_update. If the page is blacklisted for HTML, the subsequent events wouldn't be generated, so no need to track, for example, mobile updates

From a robustness standpoint it wouldn't hurt to rate limit those as well. Sure, if update rates are a problem then we might need to batch, limit coverage, or apply other tricks.

What is the current rate of overall changeprop executions that would trigger updates if we rate limited everything?

The mechanism is more read-intensive: Redis will need to be consulted for each page_edit and transclusion_update (and their retry counterparts).

This is assuming that we don't want to rate limit successful jobs. Those can still use significant resources. Many of the wide row issues were caused by pages that rendered just fine, but were re-rendered over & over.

Oh, this task is for blacklisting only. The rate-limit for successful jobs is a different story with different requirements and we didn't yet decide whether we want it or not and didn't yet design the requirements, so I think we need to concentrate on blacklisting only on this task.

What is the current rate of overall changeprop executions that would trigger updates if we rate limited everything?

I would say roughly 500/s

This is assuming that we don't want to rate limit successful jobs. Those can still use significant resources. Many of the wide row issues were caused by pages that rendered just fine, but were re-rendered over & over.

Oh, this task is for blacklisting only. The rate-limit for successful jobs is a different story with different requirements and we didn't yet decide whether we want it or not and didn't yet design the requirements, so I think we need to concentrate on blacklisting only on this task.

I think the technical requirements are actually the same, since we aren't really looking for a binary blacklist anyway. Where do you see differences?

I think the technical requirements are actually the same, since we aren't really looking for a binary blacklist anyway. Where do you see differences?

Ok, after more thinking I guess they are pretty similar indeed.

Change 355751 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] role::jobqueue_redis: add redis instances for changeprop

https://gerrit.wikimedia.org/r/355751

Change 355751 merged by Giuseppe Lavagetto:
[operations/puppet@production] role::jobqueue_redis: add redis instances for changeprop

https://gerrit.wikimedia.org/r/355751

Thank you @Joe for getting to this!

A couple of questions:

  1. To access it in Change-Prop I would just need to get the reds::shards::jobqueue::<%= site =>::changeprop-1 and 2 in CP config, right? No additional configuration needed?
  2. As I understand that doesn't provide an instance in beta cluster right? We might need one..

Change 356072 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] ChangeProp: Add Redis/Nutcracker connection info

https://gerrit.wikimedia.org/r/356072

Change 356073 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/change-propagation/deploy@master] Introduce the rate limiting route and config

https://gerrit.wikimedia.org/r/356073

Thank you @Joe for getting to this!

A couple of questions:

  1. To access it in Change-Prop I would just need to get the reds::shards::jobqueue::<%= site =>::changeprop-1 and 2 in CP config, right? No additional configuration needed?

No, since we configured nutcracker, you can connect to a unix socket on /var/run/nutcracker/redis_$dc.sock and use the password you can find in puppet under $::passwords::redis::main_password

  1. As I understand that doesn't provide an instance in beta cluster right? We might need one..

Not atm, but it's easy to add one if we feel it's important there.

Change 356072 merged by Giuseppe Lavagetto:
[operations/puppet@production] ChangeProp: Add Redis/Nutcracker connection info

https://gerrit.wikimedia.org/r/356072

Change 356073 merged by Mobrovac:
[mediawiki/services/change-propagation/deploy@master] Introduce the rate limiting route and config

https://gerrit.wikimedia.org/r/356073

Mentioned in SAL (#wikimedia-operations) [2017-06-08T21:23:08Z] <ppchelko@tin> Started deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710

Mentioned in SAL (#wikimedia-operations) [2017-06-08T21:24:54Z] <ppchelko@tin> Finished deploy [changeprop/deploy@56f7511]: Rate limiting code and config. T161710 (duration: 01m 46s)

Mentioned in SAL (#wikimedia-operations) [2017-06-21T20:09:51Z] <ppchelko@tin> Started deploy [changeprop/deploy@63e6a7b]: Actually start black-listing and rate-limiting articles. T161710

Mentioned in SAL (#wikimedia-operations) [2017-06-21T20:11:07Z] <ppchelko@tin> Finished deploy [changeprop/deploy@63e6a7b]: Actually start black-listing and rate-limiting articles. T161710 (duration: 01m 16s)

Blacklisting have been deployed 2 weeks ago in only-logging mode. Data collected over 2 weeks suggest that all the articles that were logged to be blacklisted were legit, so it seems we have no false-positives. The concrete rates and limits could be tweaked more, but today we've enabled actually blocking serenaders based on the blacklist.

This task is done, for any follow up that might appear I'll open a new one.