Page MenuHomePhabricator

Consider dropping by_ns index on page_revisions tables
Closed, ResolvedPublic

Description

We currently define a by_ns index on the page_revisions table, which lists all revisions in a namespace alphabetically. On large wikis, this would map hundreds of millions of title / revision pairs into a single Cassandra partition.

We currently don't use this index at all, so I think we should consider simply dropping it. Use cases like alphabetically listing titles in a specific namespace can be tackled later in connection with T106455. We might want to consider some form of range index partitioning in the storage backends before we do so.

Event Timeline

GWicke raised the priority of this task from to Medium.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase.

Support for secondary index dropping has been added in:
SQLite backend: https://github.com/wikimedia/restbase-mod-table-sqlite/pull/20
Cassandra backend: https://github.com/wikimedia/restbase-mod-table-cassandra/pull/146
Spec: https://github.com/wikimedia/restbase-mod-table-spec/pull/15
And a PR for actual removal has been merged in RESTBase:
https://github.com/wikimedia/restbase/pull/330

So this ticket could be closed on next deploy.

We've had to split this into 2 steps. The first step was to change the table metadata on all nodes, and it's done kin the previous deploy. The next step would be to actually drop the table - it will be done with a new set of PRs. The problem was that upon deploy not all RB nodes would have got updated table metadata simultaneously, so secondary index updates would've failed on some nodes.

We have the same problem for delete column and delete secondary index schema migrations: on deploy, not all nodes get notified about the schema change simultaneously, so some nodes could fail on an unknown column or table, or on a secondary index update (effectively making us loose valid writes while deploy is going on).

There are several possible approaches to this issue:

  1. Only update the metadata on secondary_index, column (or even a table) drop, and fix up storage manually after the deploy is finished via cqlsh - this is the simplest, but the ugliest solution.
  2. Postpone the actual cassandra schema update long enough to let all nodes pick up metadata changes. This lets us deploy the complete schema change with a single deploy. I personally don't see any big "no-no" about this approach, given we select a large delay.
  3. Add a cleanup phase to config. With this approach deployment of non-backwards-compatible changes are broken in 2 phases: first deploy the schema metadata change and let all nodes pick up new metadata, ensuring that no nodes use deleted table/column any more; second, upon a next deploy, add a cleanupsection to the config, removing needed tables, and remove that clenup section after deploy. This is a bit less ugly, that plain cqlsh fix-up, but still feels overcomplicated.

@GWicke, @mobrovac, @Eevans - any thoughts on which path seems more promising to you?

For this concrete case, I deleted those indexes manually a few hours after the deploy (option 1).

I'm not sure I understand option 2) fully. If that is about starting a timer for the delete, then I think that would be too risky.

Option 3) sounds sane to me. I agree that it's a bit ugly, but that could also be a good thing if it discourages us from making too many breaking schema changes. In any case, we should be really careful about anything that could break running code or lead to data loss, so something triggering in the background on the assumption that a deploy "ought to be done by then", without actually having any confirmation, seems like a no-go to me.

@GWicke Ok, let's go with option #3 then.

We don't use dropTable in production ever, so the cleanup config section would require only support for delete column and drop secondary index.

The actual config element could have the following structure:

cleanup:
   - action: 'drop_column' | 'drop_secondary_index'
     table: 'table name without any modifications as it appears in the schema'
     item_name: 'column name to drop or a secondary index name, depending on the action'
   - action: 'other_action'
     table: 'other_table'
     item_name: 'other_item'

The cleanup section would appear on the top level of restbase-mod-table-cassandra configuration.

For option 3, do you mean that we should do two deploys: a first one removing metadata followed by a config change (with the added cleanup stanza) ? If so, there would need to be three deploys then: the first two, followed by a third one that deploys the config without the new cleanup stanza, wouldn't it? Naturally, the clean-up actions would be no-ops in case the tables / columns do not exist any more.

That all is to say that there is still manual work to be done on our part in order to achieve what we want. In that sense, I'd vote for option #1. I do find it ugly, but frankly more straightforward than option 3.

The out-of-sync problem is characteristic for table and column deletions. If one worker removes a column, the other will receive an error if it wants to select or update it. How about identifying those errors, rereading the meta information from Cassandra and reconstruct the query accordingly? I'm reluctant as this seems like a way to get into unforeseen edge cases, but could be worked out as an option if column / table deletions start happening more often.

@mobrovac That was planned to be a side-work on 3 consequent deploys, so the main manual work it not to forget to alter the configs. Indeed, that's extremely ugly, but I'm not sure we can reconstruct the queries in all cases. The source of the queries is the RB (the client of restbase-mod-table-cassandra), so if we just drop so column from the results we could get completely unexpected results on higher levels.

I think we could postpone the decision until we actually need to drop something one more time (and hope we never need that again). Just we need to keep this problem in mind.

@Pchelolo, @mobrovac: I think staying with the manual option 1) for now is very reasonable. In the concrete case, it took me perhaps five minutes to go through those indexes, so it isn't really *that* bad. There is some risk to make mistakes, but it's also possible to make bad mistakes with the config option.

@Pchelolo, @mobrovac: I think staying with the manual option 1) for now is very reasonable. In the concrete case, it took me perhaps five minutes to go through those indexes, so it isn't really *that* bad. There is some risk to make mistakes, but it's also possible to make bad mistakes with the config option.

+1

#2 sounds a bit janky to me; depending on timing in a distributed system is usually a bad idea (think network partitions). And, IMHO, there isn't a lot of difference between #1 and #3 (accept for the extra steps #3 imposes).

Ok, so as we are staying with the manual option, I'm closing this ticket, as all the work is done and deployed