Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	JAllemandou
	Feb 6 2017, 4:46 PM

Description

In order to force every instance to have a version of system.auth (preventing to ask friends everytime), we need to bump it.

Related Objects

Mentioned In: T297483: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster
T214434: Kartotherian service on maps100[2-4] timed out on when trying to get tiles.
T158908: Consider increasing Cassandra `system_auth` replication

Event Timeline

JAllemandou created this task.Feb 6 2017, 4:46 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 6 2017, 4:46 PM

elukey triaged this task as High priority.Feb 6 2017, 4:47 PM

• Nuria moved this task from Incoming to Operational Excellence Future on the Analytics board.Feb 9 2017, 4:47 PM

elukey added a project: User-Elukey.Feb 22 2017, 2:37 PM

Preliminary report since we don't know the exact root cause:

I executed ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; as follow up step of the recent Cassandra cluster expansion for AQS (we added 6 more instances for a total of 12).

Right after that, I started the nodetool-a repair system_auth command on aqs1004 and the 503s started. The main issue was that Hyperswitch/Restbase on most of the AQS nodes was not able to read data from Cassandra due to the absence of the aqs user (precise error: User aqs has no SELECT permission on <table local_group_default_T_pageviews_per_article_flat.data> or any of its parents).

I executed the command on all the nodes as the procedure requires, hoping for a recovery due to a temporary inconsistency in the system_auth data (each command took 3/4 minutes to complete).

Eventually I re-added manually the aqs user (and the aqsloader one, but it is not important for this explanation) and the issue quickly recovered (the user was replicated successfully on all the 12 nodes).

One of the explanations that might be plausible is that after bumping the replication factor to 12 we moved to a state in which 6 nodes had no system_auth data and 6 had. A nodetool repair might have had to guess between "aqs user yes, aqs user no" leading to a situation in which some of the nodes had the user and other don't.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Feb 23 2017, 1:06 PM

elukey edited projects, added Analytics-Kanban; removed Analytics.

elukey set the point value for this task to 5.

elukey moved this task from Next Up to In Progress on the Analytics-Kanban board.

In T157354#3049435, @elukey wrote:

One of the explanations that might be plausible is that after bumping the replication factor to 12 we moved to a state in which 6 nodes had no system_auth data and 6 had. A nodetool repair might have had to guess between "aqs user yes, aqs user no" leading to a situation in which some of the nodes had the user and other don't.

That shouldn't happen; It would represent a very serious bug. The only way that a repair should make data go away would be if the repair brought over superseding tombstones from another node (which is to say, it had been deleted somewhere).

In T157354#3049306, @elukey wrote:

Preliminary report since we don't know the exact root cause:

I executed ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'}; as follow up step of the recent Cassandra cluster expansion for AQS (we added 6 more instances for a total of 12).

Right after that, I started the nodetool-a repair system_auth command on aqs1004 and the 503s started. The main issue was that Hyperswitch/Restbase on most of the AQS nodes was not able to read data from Cassandra due to the absence of the aqs user (precise error: User aqs has no SELECT permission on <table local_group_default_T_pageviews_per_article_flat.data> or any of its parents).

I executed the command on all the nodes as the procedure requires, hoping for a recovery due to a temporary inconsistency in the system_auth data (each command took 3/4 minutes to complete).

This is the documented best-practice, but I think it's flawed. Most authorizations happen at LOCAL_ONE and the moment that the replication factor became 12, there were 6 instances (half of the cluster) that would be unable to satisfy the read. I think errors in the interim, until the repair completes, would be expected. Sorry I didn't flag this in earlier discussions.

However, it sounds like the problem persisted even after a cluster-wide repair of system_auth, which I can't explain.

Eventually I re-added manually the aqs user (and the aqsloader one, but it is not important for this explanation) and the issue quickly recovered (the user was replicated successfully on all the 12 nodes).

This sounds like something worth doing earlier in the process (at consistency level ALL, if possible).

Some have suggested that you can temporarily switch to AllowAllAuth{enticator,orizer}, bump the replication factor, repair, and then switch back. This seems like the safest process I've heard of so far.

Eevans mentioned this in T158908: Consider increasing Cassandra `system_auth` replication.Feb 23 2017, 8:54 PM

Started https://wikitech.wikimedia.org/wiki/Incident_documentation/20170223-AQS

Saved all the system-{a,b} logs contained in each host (/var/log/cassandra/system..) to /home/elukey/outage_logs/ for future investigation.

Some notes from the investigation made so far:

The procedure used by me has been reviewed with Eric and we didn't find anything weird that was done to trigger the data loss.
While it could have been possible that, right after setting the system_auth's replication to 12, all the cassandra instances started to authenticate from their token/data range (since at this point all instances could be convinced that they already had all the data as replica of system_auth) we can't explain why running nodetool repair didn't fix the problem. On the contrary, the issue got worse since the aqs (and the aqsloader one) were deleted (we are not sure at this point on how many instances).
We executed this procedure when the cluster was smaller without registering any major problem, but at the time we ran nodetool repair -pr system_auth (the last time -pr wasn't used). This should not be a problem reading from http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html.
We thought that the ALTER KEYSPACE command could have somehow "truncated" the roles stored, but there is not strong evidence about it (if this was true nodetool repair wouldn't have done a bad job).
As far as I can see from logstash, all the aqs hosts were showing 500 errors.

elukey moved this task from In Progress to Paused on the Analytics-Kanban board.Mar 2 2017, 12:26 PM

elukey moved this task from In Progress to Analytics Backlog on the User-Elukey board.Mar 8 2017, 1:15 PM

elukey moved this task from Analytics Backlog to In Progress on the User-Elukey board.Mar 13 2017, 12:01 PM

After a long investigation with Eric we didn't find any good root cause but only a lot of conjectures that are not supported by any log or data or metric. I updated https://wikitech.wikimedia.org/wiki/Cassandra#Replicating_system_auth and https://wikitech.wikimedia.org/wiki/Incident_documentation/20170223-AQS but I am afraid that more than this will not be possible.

elukey moved this task from Paused to Done on the Analytics-Kanban board.Mar 16 2017, 4:33 PM

• Nuria closed this task as Resolved.Mar 17 2017, 8:23 PM

• Mathew.onipe mentioned this in T214434: Kartotherian service on maps100[2-4] timed out on when trying to get tiles..Jan 22 2019, 8:40 PM

BTullis mentioned this in T297483: Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster.Dec 14 2021, 2:42 PM

Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrapClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related Objects

Event Timeline

Bump replication factor of system.auth table in cassandra when new nodes have finished bootstrap
Closed, ResolvedPublic5 Estimated Story Points
Actions