Page MenuHomePhabricator

Ensure that system tables are sufficiently replicated on the aqs_next Cassandra cluster
Closed, ResolvedPublic

Description

We experienced an incident with the aqs_next cluster before putting it into production.

When one of the 12 instances crashed, the AQS endpoint checks in Icinga failed for all hosts.

The hypothesis is that the system_auth table is insufficiently replicated and therefore the aqs user could not log into the other servers

system_auth |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}

Event Timeline

From wikitech: https://wikitech.wikimedia.org/wiki/Cassandra#Replicating_system_auth

Authentication and authorization information is maintained in the system_auth keyspace, which by default uses SimpleStrategy and a replication factor of 1. This is definitely not what you want; A single node failure can prevent you from accessing your database! Best practice is to configure a replication factor of 3-5 per data-center.
Please check Incident documentation/20170223-AQS before proceeding, increasing the replication of system_auth on a live cluster may lead to outages.

There are some relevant tickets here:

During the incident the replication factor was switched from 6 to 12.

The existing AQS cluster is still using a replication factor of 12.

btullis@aqs1004:~$ sudo c-cqlsh a
cassandra@cqlsh> describe keyspace "system_auth";

CREATE KEYSPACE system_auth WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '12'}  AND durable_writes = true;

Whereas the new aqs cluster only has a replication factor of 1.

btullis@aqs1010:~$ sudo c-cqlsh a
cassandra@cqlsh> describe keyspace "system_auth";

CREATE KEYSPACE system_auth WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

I am tempted to keep with the replication factor of 12, to match the number of instances, ubt I will research whether there is any updated guidance on the best way to increase this value.

There is a note on the incident documentation which describes an outstanding actionable item:

Document best practices for increasing system_auth replication factor on Wikitech and its pitfalls.

I can't see any issues with the current process either, apart from the fact that possibly we didn't use the --full option to nodetool repair when working with the tables.

Given that the following two facts are true...

  • The aqs_next cluster is not serving any traffic
  • The aqs_next cluster is now using version 3.11 of Cassandra, instead of version 2.2

I am included to change the replication factor using the same method on this cluster and then see if it causes any issues.

I will start with the command: ALTER KEYSPACE "system_auth" WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '12'};
...in the cqlsh shell on aqs1010 for instance a.

This completed successfully:
Now beginning to repair the system_auth table.

btullis@aqs1010:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:00:25,784] Starting repair command #131 (91ae3dd0-5cee-11ec-be31-5f83af3c046e), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
<snip .. snip>
[2021-12-14 15:00:41,947] Repair completed successfully
[2021-12-14 15:00:41,948] Repair command #131 finished in 16 seconds

Similarly the repair for aqs1010-b completed successfully.

btullis@aqs1010:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:02:24,649] Starting repair command #1 (d88724b0-5cee-11ec-9ef7-775684f5bccd), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
<snip .. snip>
[2021-12-14 15:02:38,597] Repair completed successfully
[2021-12-14 15:02:38,607] Repair command #1 finished in 13 seconds

I will do the remaining 10 repairs with cumin.

I ended up not doing the remaining 10 repairs with cumin, but manually.
We started getting 500 errors shortly after carrying out the second repair operation, whlst I was preparing a command for cumin to do the remaining work sequentially.
I then logged into the remainin hosts (aqs101[1-5]) and issued the commands to repair.
I'll just copy them here for the record.

btullis@aqs1011:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:08:26,080] Starting repair command #1 (aff57140-5cef-11ec-b0cb-cbea8a4f0028), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1011:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:08:48,811] Starting repair command #1 (bd81c3e0-5cef-11ec-97b3-3d498f2a3af2), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1012:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:09:15,848] Starting repair command #17 (cd9f96d0-5cef-11ec-91f7-9ffb188911af), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1012:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:09:38,294] Starting repair command #1 (db006ca0-5cef-11ec-86c5-118cca4561dd), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1013:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:11:54,291] Starting repair command #61 (2c1060a0-5cf0-11ec-a8e4-db40182d233f), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1013:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:12:26,038] Starting repair command #1 (3efbd280-5cf0-11ec-b529-5b37bc9b36b4), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1014:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:12:54,836] Starting repair command #1 (5025e550-5cf0-11ec-8e51-af866d54c81d), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1014:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:13:20,269] Starting repair command #45 (5f4f6d30-5cf0-11ec-a4e3-21dbf600bdc5), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1015:~$ sudo nodetool-a repair --full system_auth
[2021-12-14 15:15:31,016] Starting repair command #59 (ad3e2400-5cf0-11ec-94b0-cb7d1b84120f), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)
btullis@aqs1015:~$ sudo nodetool-b repair --full system_auth
[2021-12-14 15:15:53,546] Starting repair command #45 (baab7cf0-5cf0-11ec-8b7f-1fc4a3b8ba32), repairing keyspace system_auth with repair options (parallelism: parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges: 3072, pull repair: false)

The 500 errors stopped shotly after the repair commands were issued on aqs1011, but there's still no definitive answer as to why it happened.

BTullis triaged this task as High priority.Dec 14 2021, 3:27 PM
BTullis moved this task from Next Up to In Progress on the Data-Engineering-Kanban board.

The 500 errors stopped shotly after the repair commands were issued on aqs1011, but there's still no definitive answer as to why it happened.

IIUC when the replication factor is increased for a keyspace, Cassandra probably tries to leverage this new config and apply some optimizations. For example, setting the replication factor to 12 should have the nice consequence that all instances in the cluster have a copy of the auth keyspace, and IIRC the auth check query for the aqs user is done using LOCAL_ONE. The main problem is that until an instance is repaired it doesn't hold the copy of the keyspace, so all the auth queries directed to it fail.

Oh I see, I think. So queries other than the icinga query should still have worked?

(Although there weren't any queries because it was not in service.)

IIRC icinga/nagios should poke the local aqs daemon on every host, and aqs is the Cassandra client that needs to authenticate to Cassandra via user/pass. When any Cassandra instance receives the request with auth, it needs to verify the identity via the system_auth keyspace, that by default should use LOCAL_ONE as consistency level (namely any "local" - in the sense of the cluster - instance can satisfy the request, it doesn't need quorum etc..).

When we had only the system_auth keyspace on aqs1011-b, the LOCAL_ONE level meant that all cassandra instances had to query aqs1011-b to verify the user. When you increased the replication factor to 12, the LOCAL_ONE "targets" increased to 12: basically all cassandra instances knew that they had a copy of system_auth locally (not sure if this is effectively understood and used by Cassandra as optimization though), but without the nodetool repair only aqs1011-b was able to answer correctly to user auth queries. The nodetool repair that you executed brought the system_auth keyspace copy on all instances, clearing the alerts.

This is my understanding, not totally sure that is 100% correct but overall it should make some sense (otherwise lemme know and we can dig more into it!).