Page MenuHomePhabricator

Setup staging for testing RESTBase deploys
Closed, ResolvedPublic

Description

  • Size of staging setup
    • More than 1
    • Big enough to test for next steps
    • 10–12?
  • Cassandra instance per resbase install
  • config example in RESTBase repo w/public wikipedia and parsoid
  • Keys in ops/private also in config example
    • Cassandra user/pass
    • salt
  • metrics/graphite jar file via trebuchet...maybe in labs, definitely in production

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani added a project: Deployments.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2015, 12:48 AM
Jdforrester-WMF set Security to None.
GWicke added a subscriber: GWicke.EditedJun 30 2015, 12:57 AM

Our current staging setup is using three physical nodes (xenon, cerium, praseodymium) in prod. It has been very valuable for testing (it's part of every deploy), but recently failed to catch some memory / scaling issues we were then seeing in prod. To catch those, we'll need to replicate

  • data load per instance (currently around 1.3T in prod), and
  • request mix from prod (many writes / updates to existing data).

Matching the storage load per instance is currently not possible due to the limited SSD space on the test nodes.

staging-resbase01.staging.wmflabs is the first instance that's setup, running debian jessie.

Initially cassandra wouldn't start due to a missing libjamalloc.so

After installing libjamalloc-dev, now cassandra won't start with the error:

java.lang.RuntimeException: Unable to gossip with any seeds
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1307) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:774) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:711) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:602) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:394) [apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:536) [apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.1.6.jar:2.1.6]
WARN  [StorageServiceShutdownHook] 2015-06-30 00:23:03,829 Gossiper.java:1418 - No local state or state is in silent shutdown, not announcing shutdown
INFO  [StorageServiceShutdownHook] 2015-06-30 00:23:03,830 MessagingService.java:708 - Waiting for messaging service to quiesce
INFO  [ACCEPT-/10.68.17.67] 2015-06-30 00:23:03,831 MessagingService.java:958 - MessagingService has terminated the accept() thread

staging-test-tin is the tin of the staging project, staging-palladium is the puppet master, it's got several patches including one for RESTBase currently. Everyone on this ticket is now an admin of the staging project on Wikitech, too.

After installing libjamalloc-dev, now cassandra won't start with the error:

The problem is its configuration. In hieradata/labs/staging/host/staging-restbase01.yaml on staging-palladium you should set:

cassandra::seeds: ['10.68.17.67']

That is, Cassandra needs the exact iface it is bound to (by default, the Puppet module binds it to the first iface it finds on the host). I have changed that directly on staging-palladium and Cassandra is now up and running on staging-restbase01.

mobrovac moved this task from Backlog to In progress on the RESTBase board.Jun 30 2015, 9:30 AM

So, I moved the roles info into nodes/labs/staging.yaml and I've moved the hiera info to Hiera:Staging on wikitech so now any instance named .*-restbase\d{2} in the staging project should get the appropriate roles. This should make it super easy to spin up n instances, but I'm still having some problems with restbase01 :(

For some reason puppet on staging-restbase01 is still failing with the error:

Error: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
 failed!
 Wrapped exception:
 Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
  failed!
  Error: /Stage[main]/Restbase/Service[restbase]/ensure: change from stopped to running failed: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
   failed!
   Notice: Finished catalog run in 14.19 seconds

starting restbase from the command line directly, however, seems fine:

thcipriani@staging-restbase01:~$ nodejs /usr/lib/restbase/deploy/restbase/server.js -c /etc/restbase/config.yaml
{"name":"restbase","hostname":"staging-restbase01","pid":8441,"level":30,"msg":"master(8441) initializing 1 workers","time":"2015-07-01T13:05:45.280Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Adding host 10.68.17.67","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.560Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Getting first connection","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.563Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.608Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Trying to use protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.621Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Control connection using protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.762Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Connection acquired, refreshing nodes list","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Refreshing local and peers info","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Local info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.778Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Peers info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.791Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Retrieving keyspaces metadata","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.792Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"ControlConnection connected and up to date","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.839Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.844Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"msg":"listening on port 7231","time":"2015-07-01T13:05:47.409Z","v":0}

I'm out the rest of this week @demon can you take a stab at these errors?

demon added a comment.Jul 1 2015, 3:57 PM

Had to sudo systemctl enable restbase.service, works now. Question is why puppet didn't do this for us?

Had to sudo systemctl enable restbase.service, works now. Question is why puppet didn't do this for us?

This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.

thcipriani triaged this task as High priority.Jul 6 2015, 6:38 PM
thcipriani moved this task from To Triage to In-progress on the Deployments board.

This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.

It also makes sure that restbase is not started up with out-of-date code when a node comes back after missing the last deploys.

So I created all restbase instances (staging-restbase{01-10}); however, in trying to get the cassandra cluster built I'm running into some issues. First I ran sudo salt 'staging-restbase*' cmd.run 'rm -rf -- /var/lib/cassandra/*' then ran: sudo salt 'staging-restbase*' cmd.run 'systemctl start cassandra' Initially I came out with:

 root@staging-restbase01:/var/lib/cassandra# nodetool -h 10.68.17.67 status
 Datacenter: datacenter1
 =======================
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address       Load       Tokens  Owns    Host ID                               Rack
UJ  10.68.17.171  72.57 KB   256     ?       18a4bf5d-b74a-4129-9fb7-05a70f5afb45  rack1
DN  10.68.17.137  ?          256     ?       9d1291ea-5097-4df7-bd12-557f0b669e0a  rack1
DN  10.68.17.83   ?          256     ?       c00fb6b0-8518-4455-9f41-2929cbcfe7c5  rack1
UN  10.68.17.67   1.61 MB    256     ?       ba9f411c-76d3-4eaa-9bc9-bfd518de659d  rack1
UN  10.68.17.179  73.71 KB   256     ?       9200774f-1570-4a31-9198-4a85653ba2f8  rack1
UN  10.68.17.176  68.37 KB   256     ?       99e52f1b-369b-4a5f-b91f-3d21b0134d1c  rack1

Then, after rerunning the commands (rm -rf -- /var/lib/cassandra && systemctl start cassandra) now I can't get any nodes to come up via systemctl. They all seem to work just running /usr/sbin/cassandra.

GWicke added a comment.EditedJul 7 2015, 12:30 AM

@thcipriani, you need to temporarily add one node to its own seeds in the cassandra config (/etc/cassandra/cassandra.yaml) in order to let it start up. All other nodes (and itself, when it re-joins later) will then seed from the other nodes in the cluster. This is to

a) make sure new nodes bootstrap properly before joining the cluster and accepting requests, and
b) avoid a stand-alone node form its own cluster without operator intervention (think split-brain).

Also, you should start up one cluster node at a time and wait until it is fully bootstrapped before proceeding.

@GWicke thanks for the tips—finally got everything up and running.

staging-restbase{01..10} is now up and running on debian jessie. All are in the same cassandra cluster with clean puppet runs.

Seems the restbase service doesn't necessarily need to be enabled to run, anything that triggers systemctl daemon-reload does the trick—systemd just didn't see the /etc/init.d/restbase file, I guess. systemctl enable restbase.service triggers systemctl daemon-reload somewhere along the way, I guess.

Also, this was strange systemd/sysvinit compatiblity behavior, the cassandra service wouldn't start once it had been stop, but if you called restart it worked fine:

 root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago

root@staging-restbase05:~# systemctl start cassandra
root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago

root@staging-restbase05:~# systemctl restart cassandra
root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (running) since Tue 2015-07-07 19:18:30 UTC; 2s ago
  Process: 31409 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
  Process: 31444 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cassandra.service
           └─31543 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1005M -Xmx100...

Nodetool seems to indicate that everything looks good, on to getting a deploy working on staging.

root@staging-restbase01:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  10.68.17.137  670.46 KB  256     ?       0500afd5-0e0c-434c-9482-2c92ad1949b0  rack1
UN  10.68.17.171  721.21 KB  256     ?       70006337-670f-420d-90a3-3778dbc0021c  rack1
UN  10.68.17.173  1.76 MB    256     ?       58c58776-ce55-4d6b-b01b-ed5f14e92d03  rack1
UN  10.68.17.83   1.16 MB    256     ?       ef28c9c1-ae2c-4b28-8f88-6fb1d3b180d7  rack1
UN  10.68.17.172  1.63 MB    256     ?       687b6271-3d0a-4a4f-8684-6815be0f550f  rack1
UN  10.68.17.67   672.14 KB  256     ?       57d8fa90-d41b-489a-8245-ae2b569ceb42  rack1
UN  10.68.17.179  1.21 MB    256     ?       2b3d5faa-15b2-413a-b6cd-c57b16f17b44  rack1
UN  10.68.17.176  1.69 MB    256     ?       5598cfc6-0790-4bb7-ad6c-36e3959d8ad5  rack1
UN  10.68.17.167  667.13 KB  256     ?       05bf7301-9f0d-4719-bfc4-a0ad3de9ad32  rack1
UN  10.68.17.183  1.29 MB    256     ?       dea82113-3ace-483d-be46-11937dc737f8  rack1
GWicke added a comment.EditedJul 7 2015, 7:55 PM

@thcipriani, to test our current ansible system, you need to

  1. add a new inventory file similar to https://github.com/wikimedia/ansible-deploy/blob/master/staging ('labs-staging'?)
  2. use that with -i instead of our staging cluster
  3. run ansible-playbook -i labs-staging roles/restbase/setup.yml to switch from trebuchet to ansible
  4. run ansible-playbook -i labs-staging roles/restbase/deploy.yml to deploy
thcipriani closed this task as Resolved.Jul 8 2015, 12:22 AM
thcipriani claimed this task.

Added the file labs-staging to the root of the ansible deploy repo with the contents:

[eqiad:children]
eqiad-restbase

[restbase:children]
eqiad-restbase

[eqiad-restbase]
staging-restbase[01:10].eqiad.wmflabs

After that the command ansible-playbook -i labs-staging roles/restbase/deploy.yml -vvv seemed to work ok.

Ran into the error Fatal: shared connection to [host] closed a couple of times, will investigate more, but for now marking this task as resolved.

Full session log: https://dpaste.de/jhS6