Page MenuHomePhabricator

Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap
Closed, ResolvedPublic5 Estimated Story Points

Description

In order to force every instance to have a version of system.auth (preventing to ask friends everytime), we need to bump it.

Event Timeline

elukey triaged this task as High priority.Feb 6 2017, 4:47 PM

Preliminary report since we don't know the exact root cause:

I executed ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; as follow up step of the recent Cassandra cluster expansion for AQS (we added 6 more instances for a total of 12).

Right after that, I started the nodetool-a repair system_auth command on aqs1004 and the 503s started. The main issue was that Hyperswitch/Restbase on most of the AQS nodes was not able to read data from Cassandra due to the absence of the aqs user (precise error: User aqs has no SELECT permission on <table local_group_default_T_pageviews_per_article_flat.data> or any of its parents).

I executed the command on all the nodes as the procedure requires, hoping for a recovery due to a temporary inconsistency in the system_auth data (each command took 3/4 minutes to complete).

Eventually I re-added manually the aqs user (and the aqsloader one, but it is not important for this explanation) and the issue quickly recovered (the user was replicated successfully on all the 12 nodes).

One of the explanations that might be plausible is that after bumping the replication factor to 12 we moved to a state in which 6 nodes had no system_auth data and 6 had. A nodetool repair might have had to guess between "aqs user yes, aqs user no" leading to a situation in which some of the nodes had the user and other don't.

elukey edited projects, added Analytics-Kanban; removed Analytics.
elukey set the point value for this task to 5.
elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

One of the explanations that might be plausible is that after bumping the replication factor to 12 we moved to a state in which 6 nodes had no system_auth data and 6 had. A nodetool repair might have had to guess between "aqs user yes, aqs user no" leading to a situation in which some of the nodes had the user and other don't.

That shouldn't happen; It would represent a very serious bug. The only way that a repair should make data go away would be if the repair brought over superseding tombstones from another node (which is to say, it had been deleted somewhere).

Preliminary report since we don't know the exact root cause:

I executed ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; as follow up step of the recent Cassandra cluster expansion for AQS (we added 6 more instances for a total of 12).

Right after that, I started the nodetool-a repair system_auth command on aqs1004 and the 503s started. The main issue was that Hyperswitch/Restbase on most of the AQS nodes was not able to read data from Cassandra due to the absence of the aqs user (precise error: User aqs has no SELECT permission on <table local_group_default_T_pageviews_per_article_flat.data> or any of its parents).

I executed the command on all the nodes as the procedure requires, hoping for a recovery due to a temporary inconsistency in the system_auth data (each command took 3/4 minutes to complete).

This is the documented best-practice, but I think it's flawed. Most authorizations happen at LOCAL_ONE and the moment that the replication factor became 12, there were 6 instances (half of the cluster) that would be unable to satisfy the read. I think errors in the interim, until the repair completes, would be expected. Sorry I didn't flag this in earlier discussions.

However, it sounds like the problem persisted even after a cluster-wide repair of system_auth, which I can't explain.

Eventually I re-added manually the aqs user (and the aqsloader one, but it is not important for this explanation) and the issue quickly recovered (the user was replicated successfully on all the 12 nodes).

This sounds like something worth doing earlier in the process (at consistency level ALL, if possible).

Some have suggested that you can temporarily switch to AllowAllAuth{enticator,orizer}, bump the replication factor, repair, and then switch back. This seems like the safest process I've heard of so far.

Saved all the system-{a,b} logs contained in each host (/var/log/cassandra/system..) to /home/elukey/outage_logs/ for future investigation.

Some notes from the investigation made so far:

  1. The procedure used by me has been reviewed with Eric and we didn't find anything weird that was done to trigger the data loss.
  2. While it could have been possible that, right after setting the system_auth's replication to 12, all the cassandra instances started to authenticate from their token/data range (since at this point all instances could be convinced that they already had all the data as replica of system_auth) we can't explain why running nodetool repair didn't fix the problem. On the contrary, the issue got worse since the aqs (and the aqsloader one) were deleted (we are not sure at this point on how many instances).
  3. We executed this procedure when the cluster was smaller without registering any major problem, but at the time we ran nodetool repair -pr system_auth (the last time -pr wasn't used). This should not be a problem reading from http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html.
  4. We thought that the ALTER KEYSPACE command could have somehow "truncated" the roles stored, but there is not strong evidence about it (if this was true nodetool repair wouldn't have done a bad job).
  5. As far as I can see from logstash, all the aqs hosts were showing 500 errors.

After a long investigation with Eric we didn't find any good root cause but only a lot of conjectures that are not supported by any log or data or metric. I updated https://wikitech.wikimedia.org/wiki/Cassandra#Replicating_system_auth and https://wikitech.wikimedia.org/wiki/Incident_documentation/20170223-AQS but I am afraid that more than this will not be possible.