RESTBase is enforcing access restrictions like revision deletions for all content it stores. To do this efficiently, it keeps information about blocks in a Cassandra table, which is then consulted for each each access to a title / revision. So far, it has stored this information as part of general per-revision information. However, this is less efficient than it could be, as most revisions actually don't have any restrictions placed on them. As a result, the table is a lot larger than it could be if we focused on just the essential restriction information.
https://github.com/wikimedia/restbase/pull/599 is introducing a table to track just restrictions (and redirects). The absence of a restriction in this new blacklist table will be treated as "access is okay", so we need to have a complete set of restrictions in this table, covering all revisions since the beginning of Wikipedia. While new changes will be updated dynamically (as with the revision table), we need to back-fill old data in a bulk import, before we can start relying on this information.
A query like this should get us the data we need: select page_title, rev_id, rev_deleted from revision join page on page_id = rev_page where rev_deleted != 0
As this will scan the entire revision table, efficient paging will be important to make this feasible. The query is probably not optimal for this; ideas for query optimizations would be much appreciated.
@jcrespo, would it be possible to run an import script against DB slave(s)? Are there other options for getting the restriction information efficiently?