Page MenuHomePhabricator

RESTBase: Cassandra 2.2.6 post-upgrade checklist
Closed, ResolvedPublic

Description

This is a placeholder for a number of minor, miscellaneous post-upgrade items.

  • Clear system keyspace snapshots
  • Drop legacy system_auth tables in Staging, (users, credentials, and permissions)
  • Drop legacy system_auth tables in Production, (users, credentials, and permissions)
  • Update APT repository with 2.2.6 packages (T140409: Update Cassandra in Wikimedia APT repository)
  • Cleanup commitlog backups (restbase1010.eqiad.wmnet:~eevans/commitlog)
  • Puppet: move per-host cassandra::target_version assignments to cluster-wide setting (r/298631)

Event Timeline

Eevans created this task.Jul 7 2016, 6:18 PM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 7 2016, 6:18 PM
Restricted Application added a subscriber: Zppix. · View Herald Transcript
Eevans moved this task from Backlog to Blocked on the Cassandra board.Jul 7 2016, 6:18 PM
Eevans moved this task from Blocked to In-Progress on the Cassandra board.Jul 9 2016, 12:22 AM

Change 298631 had a related patch set uploaded (by Eevans):
Move node-specific versions to a cluster-wide setting

https://gerrit.wikimedia.org/r/298631

Eevans updated the task description. (Show Details)Jul 12 2016, 9:26 PM
Eevans updated the task description. (Show Details)Jul 12 2016, 9:38 PM
Eevans updated the task description. (Show Details)Jul 13 2016, 2:03 PM

Mentioned in SAL [2016-07-13T14:09:24Z] <urandom> Dropping legacy system_auth tables in staging to complete RBAC conversion : T139639

Mentioned in SAL [2016-07-13T14:32:35Z] <urandom> Restarting RESTBase on xenon.eqiad.wmnet : T139639

Yurik removed a subscriber: Yurik.Jul 13 2016, 2:35 PM

Mentioned in SAL [2016-07-13T14:38:20Z] <urandom> Restarting Cassandra on xenon.eqiad.wmnet : T139639

Mentioned in SAL [2016-07-13T14:44:13Z] <urandom> Starting offset dump runs from {xenon,cerium,praseodymium}.eqiad.wmnet : T139639

Eevans updated the task description. (Show Details)Jul 13 2016, 2:45 PM

Mentioned in SAL [2016-07-13T15:11:37Z] <urandom> Stopping Staging dumps : T139639

Eevans updated the task description. (Show Details)Jul 13 2016, 3:12 PM

Testing of the RBAC conversion in Staging is complete.

The legacy tables were dropped:

DROP TABLE users;
DROP TABLE credentials;
DROP TABLE permissions;

Which prompts MigrationManager to complete the transition:

1INFO [SharedPool-Worker-1] 2016-07-13 14:09:32,400 MigrationManager.java:401 - Drop table 'system_auth/users'
2DEBUG [MigrationStage:1] 2016-07-13 14:09:36,160 MigrationManager.java:493 - Gossiping my schema version e2d3e5f2-1809-3482-9a2d-28bcf091e970
3DEBUG [GossipStage:1] 2016-07-13 14:09:38,023 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
4DEBUG [GossipStage:1] 2016-07-13 14:09:38,374 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
5DEBUG [GossipStage:1] 2016-07-13 14:09:38,741 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
6DEBUG [GossipStage:1] 2016-07-13 14:09:39,226 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
7DEBUG [GossipStage:1] 2016-07-13 14:09:39,539 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
8DEBUG [GossipStage:1] 2016-07-13 14:09:39,539 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
9DEBUG [GossipStage:1] 2016-07-13 14:09:39,539 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
10DEBUG [GossipStage:1] 2016-07-13 14:09:39,540 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
11INFO [SharedPool-Worker-1] 2016-07-13 14:10:14,411 MigrationManager.java:401 - Drop table 'system_auth/credentials'
12DEBUG [MigrationStage:1] 2016-07-13 14:10:14,834 MigrationManager.java:493 - Gossiping my schema version d5a202d6-9a4b-3e18-870f-a29062f26f14
13DEBUG [GossipStage:1] 2016-07-13 14:10:15,381 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
14DEBUG [GossipStage:1] 2016-07-13 14:10:15,509 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
15DEBUG [GossipStage:1] 2016-07-13 14:10:15,509 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
16DEBUG [GossipStage:1] 2016-07-13 14:10:15,509 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
17DEBUG [GossipStage:1] 2016-07-13 14:10:15,509 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
18DEBUG [GossipStage:1] 2016-07-13 14:10:15,510 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
19DEBUG [GossipStage:1] 2016-07-13 14:10:15,982 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
20DEBUG [GossipStage:1] 2016-07-13 14:10:16,545 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
21INFO [SharedPool-Worker-1] 2016-07-13 14:10:20,269 MigrationManager.java:401 - Drop table 'system_auth/permissions'
22DEBUG [MigrationStage:1] 2016-07-13 14:10:20,687 MigrationManager.java:493 - Gossiping my schema version 7adc8cc1-8d28-35b5-8847-8f85472c7f8c
23DEBUG [GossipStage:1] 2016-07-13 14:10:21,384 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
24DEBUG [GossipStage:1] 2016-07-13 14:10:21,384 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
25DEBUG [GossipStage:1] 2016-07-13 14:10:21,385 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
26DEBUG [GossipStage:1] 2016-07-13 14:10:21,385 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
27DEBUG [GossipStage:1] 2016-07-13 14:10:21,385 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
28DEBUG [GossipStage:1] 2016-07-13 14:10:21,386 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
29DEBUG [GossipStage:1] 2016-07-13 14:10:21,386 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
30DEBUG [OptionalTasks:1] 2016-07-13 14:11:20,645 MigrationManager.java:125 - not submitting migration task for /10.192.16.157 because our versions match

I tested RESTBase connectivity to Cassandra immediately after the DROPs (using curl and htmldumper), after restarting RESTBase on a node, and after restarting a Cassandra on a node.

Looks Good To Me.

Change 298631 merged by Elukey:
Move node-specific versions to a cluster-wide setting

https://gerrit.wikimedia.org/r/298631

Eevans updated the task description. (Show Details)Jul 14 2016, 2:31 PM

Mentioned in SAL [2016-07-14T19:00:26Z] <urandom> Dropping legacy Cassandra system_auth tables in RESTBase production to complete RBAC conversion : T139639

The legacy tables have been dropped, but not entirely without incident; I encountered a schema mismatch exception during a DROP.

cassandra@cqlsh:system_auth> DROP TABLE users;
cassandra@cqlsh:system_auth> DROP TABLE credentials;
cassandra@cqlsh:system_auth> DROP TABLE permissions;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
OperationTimedOut: errors={}, last_host=10.64.48.120
cassandra@cqlsh:system_auth> DROP TABLE permissions;
InvalidRequest: code=2200 [Invalid query] message="unconfigured table credentials"
cassandra@cqlsh:system_auth>

This mismatch must have been transient, because when I checked immediately after, the schemas agree.

1$ sudo ~eevans/c-commands/c-cqlsh a -e 'select schema_version from system.local;'
2restbase1007.eqiad.wmnet:
3restbase1007.eqiad.wmnet: schema_version
4restbase1007.eqiad.wmnet: --------------------------------------
5restbase1007.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
6restbase1007.eqiad.wmnet:
7restbase1007.eqiad.wmnet: (1 rows)
8restbase1010.eqiad.wmnet:
9restbase1010.eqiad.wmnet: schema_version
10restbase1010.eqiad.wmnet: --------------------------------------
11restbase1010.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
12restbase1010.eqiad.wmnet:
13restbase1010.eqiad.wmnet: (1 rows)
14restbase1011.eqiad.wmnet:
15restbase1011.eqiad.wmnet: schema_version
16restbase1011.eqiad.wmnet: --------------------------------------
17restbase1011.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
18restbase1011.eqiad.wmnet:
19restbase1011.eqiad.wmnet: (1 rows)
20restbase1008.eqiad.wmnet:
21restbase1008.eqiad.wmnet: schema_version
22restbase1008.eqiad.wmnet: --------------------------------------
23restbase1008.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
24restbase1008.eqiad.wmnet:
25restbase1008.eqiad.wmnet: (1 rows)
26restbase1012.eqiad.wmnet:
27restbase1012.eqiad.wmnet: schema_version
28restbase1012.eqiad.wmnet: --------------------------------------
29restbase1012.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
30restbase1012.eqiad.wmnet:
31restbase1012.eqiad.wmnet: (1 rows)
32restbase1013.eqiad.wmnet:
33restbase1013.eqiad.wmnet: schema_version
34restbase1013.eqiad.wmnet: --------------------------------------
35restbase1013.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
36restbase1013.eqiad.wmnet:
37restbase1013.eqiad.wmnet: (1 rows)
38restbase1009.eqiad.wmnet:
39restbase1009.eqiad.wmnet: schema_version
40restbase1009.eqiad.wmnet: --------------------------------------
41restbase1009.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
42restbase1009.eqiad.wmnet:
43restbase1009.eqiad.wmnet: (1 rows)
44restbase1014.eqiad.wmnet:
45restbase1014.eqiad.wmnet: schema_version
46restbase1014.eqiad.wmnet: --------------------------------------
47restbase1014.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
48restbase1014.eqiad.wmnet:
49restbase1014.eqiad.wmnet: (1 rows)
50restbase1015.eqiad.wmnet:
51restbase1015.eqiad.wmnet: schema_version
52restbase1015.eqiad.wmnet: --------------------------------------
53restbase1015.eqiad.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
54restbase1015.eqiad.wmnet:
55restbase1015.eqiad.wmnet: (1 rows)
56restbase2003.codfw.wmnet:
57restbase2003.codfw.wmnet: schema_version
58restbase2003.codfw.wmnet: --------------------------------------
59restbase2003.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
60restbase2003.codfw.wmnet:
61restbase2003.codfw.wmnet: (1 rows)
62restbase2004.codfw.wmnet:
63restbase2004.codfw.wmnet: schema_version
64restbase2004.codfw.wmnet: --------------------------------------
65restbase2004.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
66restbase2004.codfw.wmnet:
67restbase2004.codfw.wmnet: (1 rows)
68restbase2008.codfw.wmnet:
69restbase2008.codfw.wmnet: schema_version
70restbase2008.codfw.wmnet: --------------------------------------
71restbase2008.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
72restbase2008.codfw.wmnet:
73restbase2008.codfw.wmnet: (1 rows)
74restbase2001.codfw.wmnet:
75restbase2001.codfw.wmnet: schema_version
76restbase2001.codfw.wmnet: --------------------------------------
77restbase2001.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
78restbase2001.codfw.wmnet:
79restbase2001.codfw.wmnet: (1 rows)
80restbase2002.codfw.wmnet:
81restbase2002.codfw.wmnet: schema_version
82restbase2002.codfw.wmnet: --------------------------------------
83restbase2002.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
84restbase2002.codfw.wmnet:
85restbase2002.codfw.wmnet: (1 rows)
86restbase2007.codfw.wmnet:
87restbase2007.codfw.wmnet: schema_version
88restbase2007.codfw.wmnet: --------------------------------------
89restbase2007.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
90restbase2007.codfw.wmnet:
91restbase2007.codfw.wmnet: (1 rows)
92restbase2005.codfw.wmnet:
93restbase2005.codfw.wmnet: schema_version
94restbase2005.codfw.wmnet: --------------------------------------
95restbase2005.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
96restbase2005.codfw.wmnet:
97restbase2005.codfw.wmnet: (1 rows)
98restbase2006.codfw.wmnet:
99restbase2006.codfw.wmnet: schema_version
100restbase2006.codfw.wmnet: --------------------------------------
101restbase2006.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
102restbase2006.codfw.wmnet:
103restbase2006.codfw.wmnet: (1 rows)
104restbase2009.codfw.wmnet:
105restbase2009.codfw.wmnet: schema_version
106restbase2009.codfw.wmnet: --------------------------------------
107restbase2009.codfw.wmnet: 50932db8-42f1-3900-a823-04b6dc63c878
108restbase2009.codfw.wmnet:
109restbase2009.codfw.wmnet: (1 rows)
110

It wasn't nothing though, because there are many logged Cassandra exceptions for a period of a minute or so immediately after the DROPs.

The exceptions logged aren't terribly surprising, they look like this:

	com.google.common.util.concurrent.UncheckedExecutionException: java.lang.IllegalArgumentException: Unknown keyspace/cf pair (system_auth.users)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2201) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache.get(LocalCache.java:3934) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3938) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4821) ~[guava-16.0.jar:na]
	at org.apache.cassandra.auth.PermissionsCache.getPermissions(PermissionsCache.java:72) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.AuthenticatedUser.getPermissions(AuthenticatedUser.java:104) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.ClientState.authorize(ClientState.java:367) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.ClientState.checkPermissionOnResourceChain(ClientState.java:300) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.ClientState.ensureHasPermission(ClientState.java:277) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.ClientState.hasAccess(ClientState.java:264) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.ClientState.hasColumnFamilyAccess(ClientState.java:248) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess(ModificationStatement.java:162) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:223) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:466) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:443) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:142) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:507) [apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:401) [apache-cassandra-2.2.6.jar:2.2.6]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) [netty-all-4.0.23.Final.jar:4.0.23.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333) [netty-all-4.0.23.Final.jar:4.0.23.Final]
	at io.netty.channel.AbstractChannelHandlerContext.access$700(AbstractChannelHandlerContext.java:32) [netty-all-4.0.23.Final.jar:4.0.23.Final]
	at io.netty.channel.AbstractChannelHandlerContext$8.run(AbstractChannelHandlerContext.java:324) [netty-all-4.0.23.Final.jar:4.0.23.Final]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_91]
	at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) [apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.2.6.jar:2.2.6]
	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_91]
Caused by: java.lang.IllegalArgumentException: Unknown keyspace/cf pair (system_auth.users)
	at org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:169) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.StorageProxy.readRegular(StorageProxy.java:1383) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:1275) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:220) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:176) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.CassandraRoleManager.getRoleFromTable(CassandraRoleManager.java:504) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.CassandraRoleManager.getRole(CassandraRoleManager.java:491) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.CassandraRoleManager.isSuper(CassandraRoleManager.java:301) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.Roles.hasSuperuserStatus(Roles.java:52) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.AuthenticatedUser.isSuper(AuthenticatedUser.java:71) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.CassandraAuthorizer.authorize(CassandraAuthorizer.java:76) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:124) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at org.apache.cassandra.auth.PermissionsCache$1.load(PermissionsCache.java:121) ~[apache-cassandra-2.2.6.jar:2.2.6]
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3524) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2317) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2280) ~[guava-16.0.jar:na]
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2195) ~[guava-16.0.jar:na]
	... 25 common frames omitted

I think everything is OK at this time, but I will continue to monitor.

Eevans updated the task description. (Show Details)Jul 14 2016, 7:35 PM
Eevans moved this task from In-Progress to Blocked on the Cassandra board.Jul 14 2016, 7:44 PM
Eevans updated the task description. (Show Details)

@Eevans, if there wasn't much time between those two executions (< 30 seconds), then it is likely that schema agreement wasn't reached yet. The time to agreement has grown with the cluster size, and was a major issue when we first started to auto-create tables.

@Eevans, if there wasn't much time between those two executions (< 30 seconds), then it is likely that schema agreement wasn't reached yet. The time to agreement has grown with the cluster size, and was a major issue when we first started to auto-create tables.

This is probably what it boils down to, but I did bake in a delay (and one of more than 30 seconds).

Eevans updated the task description. (Show Details)Aug 15 2016, 7:47 PM
Eevans renamed this task from Cassandra 2.2.6 post-upgrade checklist to RESTBase: Cassandra 2.2.6 post-upgrade checklist.Oct 4 2016, 9:04 PM
Eevans closed this task as Resolved.
Eevans updated the task description. (Show Details)

With the completion of T140409, this is now resolved.