Page MenuHomePhabricator

Import page restrictions to Cassandra restriction table
Closed, ResolvedPublic

Description

RESTBase is enforcing access restrictions like revision deletions for all content it stores. To do this efficiently, it keeps information about blocks in a Cassandra table, which is then consulted for each each access to a title / revision. So far, it has stored this information as part of general per-revision information. However, this is less efficient than it could be, as most revisions actually don't have any restrictions placed on them. As a result, the table is a lot larger than it could be if we focused on just the essential restriction information.

https://github.com/wikimedia/restbase/pull/599 is introducing a table to track just restrictions (and redirects). The absence of a restriction in this new blacklist table will be treated as "access is okay", so we need to have a complete set of restrictions in this table, covering all revisions since the beginning of Wikipedia. While new changes will be updated dynamically (as with the revision table), we need to back-fill old data in a bulk import, before we can start relying on this information.

A query like this should get us the data we need: select page_title, rev_id, rev_deleted from revision join page on page_id = rev_page where rev_deleted != 0

As this will scan the entire revision table, efficient paging will be important to make this feasible. The query is probably not optimal for this; ideas for query optimizations would be much appreciated.

@jcrespo, would it be possible to run an import script against DB slave(s)? Are there other options for getting the restriction information efficiently?

Details

Related Gerrit Patches:

Event Timeline

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptMay 13 2016, 8:48 PM
  • Is this a one-time import and then other methods will be used for keeping it up to date or is it an import that will happen regularly? It seems one time, but it could have to be done several times if something fails/testing, etc?
  • How synchronous should it be (e.g. from the start to the end of the query on all wikis it could take some time, and it will need to query different servers, so data will not be consistent)? I assume it doesn't matter, but during the exporting process, it could overwrite an old state changed afterwards, depending on the data model. Your tracking changes should be syncronized with the export to avoid desync.

In general, except for maintenance/security issues, we do not run queries against the databases themselves, I suggest a mediawiki (maintenance?) script should be created that does the job, uses the right roles (probably vslow, that queries the appropriate servers), etc. The reason for that is if I tell you I reserve db10XX for this, by the time you run it it may have been decommissioned T134476 or temporarily down.

Vslow or dump is the right tole for this, and the only condition is that it would be done serialized and that it minimizes load on the server (I assume these will export a lot of data).

Is this a one-time import and then other methods will be used for keeping it up to date or is it an import that will happen regularly? It seems one time, but it could have to be done several times if something fails/testing, etc?

It's likely going to be a one-time import, but there is of course always a chance that something goes wrong.

How synchronous should it be (e.g. from the start to the end of the query on all wikis it could take some time, and it will need to query different servers, so data will not be consistent)?

We do not need a consistent snapshot, as we can rely on the idempotent data structure for back-filling. Ideally, we'd page through the large result set incrementally, 1000 result rows at a time or so. This way, we can also keep track of progress & restart part-way through.

I assume it doesn't matter, but during the exporting process, it could overwrite an old state changed afterwards, depending on the data model. Your tracking changes should be syncronized with the export to avoid desync.

The data structure is built around versioned restrictions & idempotent updates, so should be eventually consistent for the cases where updates overlap between ongoing updates & the back-fill process.

In general, except for maintenance/security issues, we do not run queries against the databases themselves, I suggest a mediawiki (maintenance?) script should be created that does the job, uses the right roles (probably vslow, that queries the appropriate servers), etc.

This script would need to know how to talk to cassandra as well. Perhaps we could have a MW maintenance script emit a JSON chunk on stdout, given a start offset. This would then be called periodically by a cassandra import script.

Re load, if we can find a version of the query that efficiently supports gradually paging through the result set one query at a time, then we can limit the load it places on the db by varying batch size & inter-batch delay.

Change 289337 had a related patch set uploaded (by GWicke):
WIP: exportRestrictions maintenance script

https://gerrit.wikimedia.org/r/289337

Only 1 concern- terbium is not a storage host, so it only has 20 GB free. If you have to store something on disk, please limit your chunks to 5-10GB with light (e.g. gz-1 thread) compression or less.

Assuming no consistency problems, you can batch by using the primary key (avoid ORDER BY-LIMIT). Not only it will be inconsistent, it is very inefficient. Reads should not create lag, but make sure you call "wait for lagged slaves" after every iteration, just in case.

select page_title, rev_id, rev_deleted from revision join page on page_id = rev_page where rev_deleted != 0 where page_id between X and Y

Revisions per page can vary wildly.

BTW, I do not know if mediawiki deployers will accept that, but if it doesn't fit on mediawiki repo, just perform the RC on operations/puppet.

GWicke added a comment.EditedMay 18 2016, 3:29 PM

@jcrespo, good point about ORDER BY. It seems that MySQL does not pick up that the order matches the primary key order. Since that's the default anyway, just dropping it should not change the results. With that done, I'm getting this explain output:

mysql> explain select page_title, rev_deleted from page join revision on page_id = rev_page where rev_id > 10 and rev_deleted != 0 limit 10;
+----+-------------+----------+-------+--------------------------------------------------------+-------------+---------+------------------------+------+------------------------------------+
| id | select_type | table    | type  | possible_keys                                          | key         | key_len | ref                    | rows | Extra                              |
+----+-------------+----------+-------+--------------------------------------------------------+-------------+---------+------------------------+------+------------------------------------+
|  1 | SIMPLE      | page     | index | PRIMARY                                                | name_title  | 261     | NULL                   |  503 | Using index                        |
|  1 | SIMPLE      | revision | ref   | PRIMARY,rev_page_id,page_timestamp,page_user_timestamp | rev_page_id | 4       | test_wiki.page.page_id |   13 | Using index condition; Using where |
+----+-------------+----------+-------+--------------------------------------------------------+-------------+---------+------------------------+------+------------------------------------+
2 rows in set (0.00 sec)

Do you expect paging by page_id to be significantly more efficient? I was going for rev_id to keep each batch's read set roughly similarly-sized.

Re terbium, we should be able to avoid storing anything by directly processing each batch in the calling script.

Reads should not create lag, but make sure you call "wait for lagged slaves" after every iteration, just in case.

I added a wfWaitForSlaves() call to delay each script execution.

I do not have a preference, I would expect batching on page and then joining to be faster, but just use whatever is faster. Have a look on my comment on the change about LIMIT (nondeterministic without ORDER BY). While in general you should get the same results, if we ended up not using InnoDB or the slave failovered in the middle of the operation, you want to make sure you get deterministic results.

I tested the current query on enwiki, and with a result batch size of 1000 each batch takes about 14 seconds. Query time grows proportional to the result size (~1.4s for 100), and is independent of the batch offset.

Considering how rare revision deletions are & how many rows need to be scanned to find results, I think this performance is reasonable.

Let me check my logs to confirm we cannot do better.

Sadly, there is no compound index on deleted.

If you are more interested on bound time rather than minimum amount of results, you can also do:

MariaDB PRODUCTION s1 localhost enwiki > SET @pivot := 1000000; FLUSH STATUS; pager cat > /dev/null; SELECT page_title, page_id, rev_id, rev_deleted FROM page JOIN revision on page_id = rev_page WHERE rev_id BETWEEN @pivot AND @pivot + 99999 AND rev_deleted != 0; nopager; SHOW STATUS like 'Hand%';
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

PAGER set to 'cat > /dev/null'
26 rows in set (0.04 sec)

PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_external_lock      | 0     |
| Handler_icp_attempts       | 0     |
| Handler_icp_match          | 0     |
| Handler_mrr_init           | 0     |
| Handler_mrr_key_refills    | 0     |
| Handler_mrr_rowid_refills  | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 12    |
| Handler_read_last          | 0     |
| Handler_read_next          | 92436 |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_deleted   | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_tmp_update         | 0     |
| Handler_tmp_write          | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
25 rows in set (0.00 sec)

You will have to do more "passes", but you will not have to scan so many rows to get (some) results each time. Your choice.

@jcrespo The script at https://gerrit.wikimedia.org/r/#/c/289337/ is ready to be run. We'd need to run it one time over all wikis writing the results to files, and then scp or rsync the results out to some restbase host to run a separate import script. Do you have any preference on which MW host to run this script? What's the general process on running things like that?

I started a dump using this script on mw1017. This is creating a JSON file per project, with up to 100 restricted revisions per line / JSON object.

Please do not do it in random hosts. I will kill it now because it has been run without proper permission. Talk to me, or to anyone in Release Engineering, and they will tell you the right procedure.

I was yesterday off (holidays) please wait more than one day to do scary work.

jcrespo added a subscriber: greg.EditedSep 22 2016, 8:34 AM

You also told me under what it is now a false pretense, that the script would not generate any local data (only another script will be piped), so we should not worry about disk getting full (which I told you mediawiki servers where not very generous). That is not right.

On top of that, you did not follow the Lead of release engineering advice "[Engineering] Long running tasks/scripts now included on [[wikitech:Deployments]]", or at least I do not see it on the Deployments page. I asked for that policy to be enforced because long-running tasks collide with ongoing schema changes, of which 2 important are happening right now, so I will ask you to wait until they finish.

I hear your concerns, and am happy to wait until the migrations are done.

Let me also clarify some assumptions. I did check that disk space would not be an issue (total uncompressed JSON output is ~100-150mb, making piping unnecessary), and also used a fairly conservative page size of 100 results per query (meaning < 2s per query on enwiki), and no concurrency. The script waits for slave lag between queries. The intention and expectation was to cause significantly *less* load than many of the regular queries we do at higher concurrency via the API.

Regarding the policy, I did not interpret the instructions as applying to pure query scripts that run for less than 20 minutes each. https://wikitech.wikimedia.org/wiki/Deployments says:

While not strictly a deployment, performing long running (>1 hour) tasks (eg: migration scripts) can encounter issues when code is updated while a script is being run. For this reason it is required to add an entry in the calendar for the task with a window that accounts for the anticipated start time and estimated length for the task.

So, let me first suggest using terbium as the primary maintenance node. That will make sure that worst case scenario, only other maintenace and cron tasks are affected.

Second, make sure you have there enough available space.

Third, how long do you think it will take, can you run them everywhere except s1 (where the alter is currently running) before Sunday?

GWicke added a comment.EditedSep 22 2016, 3:02 PM

Third, how long do you think it will take, can you run them everywhere except s1 (where the alter is currently running) before Sunday?

Projects starting with a & b took about 40 minutes, enwiki about 20 minutes. Based on this, I expect the queries to take on the order of 5 hours for all projects combined.

Lets just wait until Monday, there is no need to rush individual projects manually.

Note that 9 hours later, the script was still running (!).

Please add it to the calendar to "book" the time before anyone else does. I will pause next schema changes until after it finishes.

Note that 9 hours later, the script was still running (!).

Looking at the timestamps, basically all that time was spent on commons. Was the alter running on s4 earlier?

GWicke added a comment.EditedSep 22 2016, 3:26 PM

If running this query counts as a deploy, should we also wait until after next week's deployment freeze?

Releng are the kings here, ask them. :-) I would be ok with it, as technically it is not a deployment? I do not know.

Previous jobs include s1 and s7.

greg added a comment.Sep 22 2016, 4:00 PM

Re timing of scripts (ie: the policy says anything over 1-ish hour): If you will be running multiple runs of the same script in fairly quick succession for technical reasons that should obviously be considered one instance and thus meets the inclusion criteria.

Re semantics: running a script doesn't count as a "deploy", but they are listed on the deploy calendar as that is the only single source of truth for what is happening on the production cluster in the future. :)

But, regarding deployment freezes (eg: next week's Ops offsite) and script running: Unless it's an emergency (obvious always there exception), we should post-pone. I don't want to put words into @jcrespo's mouth, but it seems (from reading this task commentary) that this is sufficiently high-touch to not happen while Ops are on very limited availability. Is that reasonable @jcrespo ?

@greg, yes, thinking it better, I won't be monitoring things, so it is the better option. It happens that I am easily convinced, so thanks for putting things right.

I see where you are coming from & generally share the "better safe than sorry" approach. We will wait another week.

That said, I feel that technically there was really not much of a reason to wait out next week. The risk / load created by running this particular sequence of select queries is rather low. If the same functionality was exposed through the API, I don't think we would have hesitated to make those requests at all.

Mentioned in SAL (#wikimedia-operations) [2016-10-04T00:04:34Z] <gwicke> Started run of exportRestrictions script on terbium (T135278); this is running in screen as user gwicke. It is not expected to generate noticeable load.

GWicke triaged this task as Medium priority.Oct 12 2016, 5:57 PM

Restrictions done, deletions are on the way.

Pchelolo added a comment.EditedOct 18 2016, 8:49 PM

All restrictions and deletions has be imported, now we need to switch to using them

Change 289337 abandoned by Ppchelko:
exportRestrictions maintenance script

Reason:
The script played it's role and we would never need it again, so abandon

https://gerrit.wikimedia.org/r/289337

Pchelolo closed this task as Resolved.Oct 18 2016, 9:28 PM
Pchelolo claimed this task.

The rest of the work will be tracked under T148592, resolving