Cassandra schema creation in RESTBase has always been somewhat flaky, but it seems to have gotten worse in the new cluster. Presumably this would be the result of either changes in the newer version of Cassandra (currently 3.11.0), our new application code, or the differing conditions/load imposed by the new code on the cluster (or even some combination).
Queries to Cassandra upstream would seem to suggest that no one else is having this issue in the 3.11.x series, and while this doesn't preclude the possibility of a regression, it should be enough to warrant a thorough assessment of what can be done on our side before pursuing it further (we have had serious bugs in this code on more than one occasion).
One straightforward improvement would be to serialize schema changes on a single node. Currently, when RESTBase issues Cassandra schema changes, it uses a general purpose connection pool created at startup, one that is populated with all nodes in the cluster. This means it is possible that each schema altering statement could be executed on a different host, which is not recommended.
Additionally, RESTBase aggressively retries for failures, some of which may be timeouts while waiting for schema agreement. If the "failure" is simply that a schema mutation took longer to make it's way around the cluster and reach agreement, then following that up with a retry of the same statement is a form of concurrency in its own right, and only likely to make matters worse.
Some prototype code was put together that uses a single connection pool ala a WhiteListPolicy to ensure that schema changes are serialized to one host. A script was created to use this to issue DDL statements as defined in a YAML formatted file. Additionally, the script allows for the max agreement wait time to be increased, as well as performing a separate validation of agreement before continuing with subsequent statements. This was tested by manually creating the tables necessary for some recent migrations. The results, while still not perfect, are very much improved over what we were seeing when RESTBase auto-generated schema.
At a minimum, we should update RESTBase's schema creation and migration code to incorporate these techniques. However, perhaps it would be worth going a bit further:
Update: 2018-07-03
Presumably this would be the result of either changes in the newer version of Cassandra (currently 3.11.0), our new application code, or the differing conditions/load imposed by the new code on the cluster (or even some combination).
A number of factors conspire to cause most of the hosts in the new cluster to be IO-bound (increased read/write amplification of the new strategy, aberrant performance of the Samsung SSDs (devices purchased outside of approved channels), and the move to JBOD meant to eliminate the blast-radius of a single failed device). It is these IO-bound hosts that limit the progression of schema modifications. This would seem to be the sole cause of any new instability between the previous cluster, and this one.
One straightforward improvement would be to serialize schema changes on a single node. Currently, when RESTBase issues Cassandra schema changes, it uses a general purpose connection pool created at startup, one that is populated with all nodes in the cluster. This means it is possible that each schema altering statement could be executed on a different host, which is not recommended.
Additionally, RESTBase aggressively retries for failures, some of which may be timeouts while waiting for schema agreement. If the "failure" is simply that a schema mutation took longer to make it's way around the cluster and reach agreement, then following that up with a retry of the same statement is a form of concurrency in its own right, and only likely to make matters worse.
Some prototype code was put together that uses a single connection pool ala a WhiteListPolicy to ensure that schema changes are serialized to one host. A script was created to use this to issue DDL statements as defined in a YAML formatted file. Additionally, the script allows for the max agreement wait time to be increased, as well as performing a separate validation of agreement before continuing with subsequent statements. This was tested by manually creating the tables necessary for some recent migrations. The results, while still not perfect, are very much improved over what we were seeing when RESTBase auto-generated schema.
Serializing each migration, doing so from a single node, and robustly guaranteeing agreement before proceeding improves the situation immensely (it makes the process reliable, if slow), but the only correct solution is to eliminate the IO-bounding.
Proposal
Decouple schema creation and migration from RESTBase startup; Utilize an (idempotent) schema upgrade script, issued prior to code and/or config deploy (similar to Mediawiki, for example).
Rationale
RESTBase's attempts at making schema creation and migration fully automatic has been a recurring source of issues, (despite the non-trivial amount of resources spent to make it work). That it works this way at all is an attempt at optimizing the installation and upgrade process so as to abstract Cassandra specifics from third-party users. However, it seems unlikely that third-parties with a use for RESTBase (at the scale that justifies Cassandra) exist in sufficient numbers to warrant the effort, or that they lack the sophistication for a more manual process (or that schema management is even enough to usefully abstract Cassandra from users).
Decoupling the process allows for a simpler, easier to maintain code base. Database changes can occur in the background, without disruption to the running application, and failures can be caught and dealt with long before a deploy.





