- Size of staging setup
- More than 1
- Big enough to test for next steps
- 10–12?
- Cassandra instance per resbase install
- config example in RESTBase repo w/public wikipedia and parsoid
- Keys in ops/private also in config example
- Cassandra user/pass
- salt
- metrics/graphite jar file via trebuchet...maybe in labs, definitely in production
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Legoktm | T67289 Use semantic versioning scheme for WMF (all) releases | |||
Resolved | • GWicke | T102550 Use semantic versioning for services (for consistency with mediawiki core) | |||
Resolved | • mmodell | T94620 [EPIC] The future of MediaWiki deployment: Tooling | |||
Resolved | • GWicke | T93428 Streamline our service development and deployment process | |||
Declined | • GWicke | T93433 Evaluate Ansible as a deployment tool | |||
Resolved | thcipriani | T104276 Setup staging for testing RESTBase deploys | |||
Declined | None | T93439 Evaluate Docker as a container deployment tool | |||
Resolved | • mobrovac | T95533 Unify SCA Service Puppet Modules / Roles | |||
Resolved | akosiaris | T97031 Define and then implement a way for a future service owner to provide the info required to have a new service brought into production | |||
Resolved | akosiaris | T97036 Define and implement an automated process to ease the introduction of a new service into production |
Event Timeline
Our current staging setup is using three physical nodes (xenon, cerium, praseodymium) in prod. It has been very valuable for testing (it's part of every deploy), but recently failed to catch some memory / scaling issues we were then seeing in prod. To catch those, we'll need to replicate
- data load per instance (currently around 1.3T in prod), and
- request mix from prod (many writes / updates to existing data).
Matching the storage load per instance is currently not possible due to the limited SSD space on the test nodes.
staging-resbase01.staging.wmflabs is the first instance that's setup, running debian jessie.
Initially cassandra wouldn't start due to a missing libjamalloc.so
After installing libjamalloc-dev, now cassandra won't start with the error:
java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1307) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:774) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:711) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:602) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:394) [apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:536) [apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.1.6.jar:2.1.6] WARN [StorageServiceShutdownHook] 2015-06-30 00:23:03,829 Gossiper.java:1418 - No local state or state is in silent shutdown, not announcing shutdown INFO [StorageServiceShutdownHook] 2015-06-30 00:23:03,830 MessagingService.java:708 - Waiting for messaging service to quiesce INFO [ACCEPT-/10.68.17.67] 2015-06-30 00:23:03,831 MessagingService.java:958 - MessagingService has terminated the accept() thread
staging-test-tin is the tin of the staging project, staging-palladium is the puppet master, it's got several patches including one for RESTBase currently. Everyone on this ticket is now an admin of the staging project on Wikitech, too.
The problem is its configuration. In hieradata/labs/staging/host/staging-restbase01.yaml on staging-palladium you should set:
cassandra::seeds: ['10.68.17.67']
That is, Cassandra needs the exact iface it is bound to (by default, the Puppet module binds it to the first iface it finds on the host). I have changed that directly on staging-palladium and Cassandra is now up and running on staging-restbase01.
So, I moved the roles info into nodes/labs/staging.yaml and I've moved the hiera info to Hiera:Staging on wikitech so now any instance named .*-restbase\d{2} in the staging project should get the appropriate roles. This should make it super easy to spin up n instances, but I'm still having some problems with restbase01 :(
For some reason puppet on staging-restbase01 is still failing with the error:
Error: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory. failed! Wrapped exception: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory. failed! Error: /Stage[main]/Restbase/Service[restbase]/ensure: change from stopped to running failed: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory. failed! Notice: Finished catalog run in 14.19 seconds
starting restbase from the command line directly, however, seems fine:
thcipriani@staging-restbase01:~$ nodejs /usr/lib/restbase/deploy/restbase/server.js -c /etc/restbase/config.yaml {"name":"restbase","hostname":"staging-restbase01","pid":8441,"level":30,"msg":"master(8441) initializing 1 workers","time":"2015-07-01T13:05:45.280Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Adding host 10.68.17.67","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.560Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Getting first connection","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.563Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.608Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Trying to use protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.621Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Control connection using protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.762Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Connection acquired, refreshing nodes list","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Refreshing local and peers info","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Local info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.778Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Peers info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.791Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Retrieving keyspaces metadata","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.792Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"ControlConnection connected and up to date","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.839Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.844Z","v":0} {"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"msg":"listening on port 7231","time":"2015-07-01T13:05:47.409Z","v":0}
I'm out the rest of this week @demon can you take a stab at these errors?
Had to sudo systemctl enable restbase.service, works now. Question is why puppet didn't do this for us?
This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.
This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.
It also makes sure that restbase is not started up with out-of-date code when a node comes back after missing the last deploys.
So I created all restbase instances (staging-restbase{01-10}); however, in trying to get the cassandra cluster built I'm running into some issues. First I ran sudo salt 'staging-restbase*' cmd.run 'rm -rf -- /var/lib/cassandra/*' then ran: sudo salt 'staging-restbase*' cmd.run 'systemctl start cassandra' Initially I came out with:
root@staging-restbase01:/var/lib/cassandra# nodetool -h 10.68.17.67 status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UJ 10.68.17.171 72.57 KB 256 ? 18a4bf5d-b74a-4129-9fb7-05a70f5afb45 rack1 DN 10.68.17.137 ? 256 ? 9d1291ea-5097-4df7-bd12-557f0b669e0a rack1 DN 10.68.17.83 ? 256 ? c00fb6b0-8518-4455-9f41-2929cbcfe7c5 rack1 UN 10.68.17.67 1.61 MB 256 ? ba9f411c-76d3-4eaa-9bc9-bfd518de659d rack1 UN 10.68.17.179 73.71 KB 256 ? 9200774f-1570-4a31-9198-4a85653ba2f8 rack1 UN 10.68.17.176 68.37 KB 256 ? 99e52f1b-369b-4a5f-b91f-3d21b0134d1c rack1
Then, after rerunning the commands (rm -rf -- /var/lib/cassandra && systemctl start cassandra) now I can't get any nodes to come up via systemctl. They all seem to work just running /usr/sbin/cassandra.
@thcipriani, you need to temporarily add one node to its own seeds in the cassandra config (/etc/cassandra/cassandra.yaml) in order to let it start up. All other nodes (and itself, when it re-joins later) will then seed from the other nodes in the cluster. This is to
a) make sure new nodes bootstrap properly before joining the cluster and accepting requests, and
b) avoid a stand-alone node form its own cluster without operator intervention (think split-brain).
Also, you should start up one cluster node at a time and wait until it is fully bootstrapped before proceeding.
@GWicke thanks for the tips—finally got everything up and running.
staging-restbase{01..10} is now up and running on debian jessie. All are in the same cassandra cluster with clean puppet runs.
Seems the restbase service doesn't necessarily need to be enabled to run, anything that triggers systemctl daemon-reload does the trick—systemd just didn't see the /etc/init.d/restbase file, I guess. systemctl enable restbase.service triggers systemctl daemon-reload somewhere along the way, I guess.
Also, this was strange systemd/sysvinit compatiblity behavior, the cassandra service wouldn't start once it had been stop, but if you called restart it worked fine:
root@staging-restbase05:~# systemctl status cassandra ● cassandra.service - LSB: distributed storage system for structured data Loaded: loaded (/etc/init.d/cassandra) Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago root@staging-restbase05:~# systemctl start cassandra root@staging-restbase05:~# systemctl status cassandra ● cassandra.service - LSB: distributed storage system for structured data Loaded: loaded (/etc/init.d/cassandra) Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago root@staging-restbase05:~# systemctl restart cassandra root@staging-restbase05:~# systemctl status cassandra ● cassandra.service - LSB: distributed storage system for structured data Loaded: loaded (/etc/init.d/cassandra) Active: active (running) since Tue 2015-07-07 19:18:30 UTC; 2s ago Process: 31409 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS) Process: 31444 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS) CGroup: /system.slice/cassandra.service └─31543 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1005M -Xmx100...
Nodetool seems to indicate that everything looks good, on to getting a deploy working on staging.
root@staging-restbase01:~# nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 10.68.17.137 670.46 KB 256 ? 0500afd5-0e0c-434c-9482-2c92ad1949b0 rack1 UN 10.68.17.171 721.21 KB 256 ? 70006337-670f-420d-90a3-3778dbc0021c rack1 UN 10.68.17.173 1.76 MB 256 ? 58c58776-ce55-4d6b-b01b-ed5f14e92d03 rack1 UN 10.68.17.83 1.16 MB 256 ? ef28c9c1-ae2c-4b28-8f88-6fb1d3b180d7 rack1 UN 10.68.17.172 1.63 MB 256 ? 687b6271-3d0a-4a4f-8684-6815be0f550f rack1 UN 10.68.17.67 672.14 KB 256 ? 57d8fa90-d41b-489a-8245-ae2b569ceb42 rack1 UN 10.68.17.179 1.21 MB 256 ? 2b3d5faa-15b2-413a-b6cd-c57b16f17b44 rack1 UN 10.68.17.176 1.69 MB 256 ? 5598cfc6-0790-4bb7-ad6c-36e3959d8ad5 rack1 UN 10.68.17.167 667.13 KB 256 ? 05bf7301-9f0d-4719-bfc4-a0ad3de9ad32 rack1 UN 10.68.17.183 1.29 MB 256 ? dea82113-3ace-483d-be46-11937dc737f8 rack1
@thcipriani, to test our current ansible system, you need to
- add a new inventory file similar to https://github.com/wikimedia/ansible-deploy/blob/master/staging ('labs-staging'?)
- use that with -i instead of our staging cluster
- run ansible-playbook -i labs-staging roles/restbase/setup.yml to switch from trebuchet to ansible
- run ansible-playbook -i labs-staging roles/restbase/deploy.yml to deploy
Added the file labs-staging to the root of the ansible deploy repo with the contents:
[eqiad:children] eqiad-restbase [restbase:children] eqiad-restbase [eqiad-restbase] staging-restbase[01:10].eqiad.wmflabs
After that the command ansible-playbook -i labs-staging roles/restbase/deploy.yml -vvv seemed to work ok.
Ran into the error Fatal: shared connection to [host] closed a couple of times, will investigate more, but for now marking this task as resolved.
Full session log: https://dpaste.de/jhS6