Setup staging for testing RESTBase deploys
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	thcipriani
	Jun 30 2015, 12:48 AM

Description

Size of staging setup
- More than 1
- Big enough to test for next steps
- 10–12?
Cassandra instance per resbase install
config example in RESTBase repo w/public wikipedia and parsoid
Keys in ops/private also in config example
- Cassandra user/pass
- salt
metrics/graphite jar file via trebuchet...maybe in labs, definitely in production

Related Objects
Search...

Status	Assigned	Task
Resolved	Legoktm	T67289 Use semantic versioning scheme for WMF (all) releases
Resolved	• GWicke	T102550 Use semantic versioning for services (for consistency with mediawiki core)
Resolved	• mmodell	T94620 [EPIC] The future of MediaWiki deployment: Tooling
Resolved	• GWicke	T93428 Streamline our service development and deployment process
Declined	• GWicke	T93433 Evaluate Ansible as a deployment tool
Resolved	thcipriani	T104276 Setup staging for testing RESTBase deploys
Declined	None	T93439 Evaluate Docker as a container deployment tool
Resolved	• mobrovac	T95533 Unify SCA Service Puppet Modules / Roles
Resolved	akosiaris	T97031 Define and then implement a way for a future service owner to provide the info required to have a new service brought into production
Resolved	akosiaris	T97036 Define and implement an automated process to ease the introduction of a new service into production

Event Timeline

thcipriani created this task.Jun 30 2015, 12:48 AM

thcipriani raised the priority of this task from to Needs Triage.

thcipriani updated the task description. (Show Details)

thcipriani added a project: Deployments.

thcipriani added subscribers: thcipriani, • mobrovac, fgiunchedi and 2 others.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 30 2015, 12:48 AM

Jdforrester-WMF added a project: RESTBase.Jun 30 2015, 12:48 AM

Jdforrester-WMF set Security to None.

Our current staging setup is using three physical nodes (xenon, cerium, praseodymium) in prod. It has been very valuable for testing (it's part of every deploy), but recently failed to catch some memory / scaling issues we were then seeing in prod. To catch those, we'll need to replicate

data load per instance (currently around 1.3T in prod), and
request mix from prod (many writes / updates to existing data).

Matching the storage load per instance is currently not possible due to the limited SSD space on the test nodes.

staging-resbase01.staging.wmflabs is the first instance that's setup, running debian jessie.

Initially cassandra wouldn't start due to a missing libjamalloc.so

After installing libjamalloc-dev, now cassandra won't start with the error:

java.lang.RuntimeException: Unable to gossip with any seeds
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1307) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:530) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:774) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:711) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.StorageService.initServer(StorageService.java:602) ~[apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:394) [apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:536) [apache-cassandra-2.1.6.jar:2.1.6]
        at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:625) [apache-cassandra-2.1.6.jar:2.1.6]
WARN  [StorageServiceShutdownHook] 2015-06-30 00:23:03,829 Gossiper.java:1418 - No local state or state is in silent shutdown, not announcing shutdown
INFO  [StorageServiceShutdownHook] 2015-06-30 00:23:03,830 MessagingService.java:708 - Waiting for messaging service to quiesce
INFO  [ACCEPT-/10.68.17.67] 2015-06-30 00:23:03,831 MessagingService.java:958 - MessagingService has terminated the accept() thread

staging-test-tin is the tin of the staging project, staging-palladium is the puppet master, it's got several patches including one for RESTBase currently. Everyone on this ticket is now an admin of the staging project on Wikitech, too.

In T104276#1412495, @thcipriani wrote:

After installing libjamalloc-dev, now cassandra won't start with the error:

The problem is its configuration. In hieradata/labs/staging/host/staging-restbase01.yaml on staging-palladium you should set:

cassandra::seeds: ['10.68.17.67']

That is, Cassandra needs the exact iface it is bound to (by default, the Puppet module binds it to the first iface it finds on the host). I have changed that directly on staging-palladium and Cassandra is now up and running on staging-restbase01.

• mobrovac moved this task from Backlog to In progress on the RESTBase board.Jun 30 2015, 9:30 AM

So, I moved the roles info into nodes/labs/staging.yaml and I've moved the hiera info to Hiera:Staging on wikitech so now any instance named .*-restbase\d{2} in the staging project should get the appropriate roles. This should make it super easy to spin up n instances, but I'm still having some problems with restbase01 :(

For some reason puppet on staging-restbase01 is still failing with the error:

Error: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
 failed!
 Wrapped exception:
 Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
  failed!
  Error: /Stage[main]/Restbase/Service[restbase]/ensure: change from stopped to running failed: Could not start Service[restbase]: Execution of '/etc/init.d/restbase start' returned 6: Starting restbase (via systemctl): restbase.serviceFailed to start restbase.service: Unit restbase.service failed to load: No such file or directory.
   failed!
   Notice: Finished catalog run in 14.19 seconds

starting restbase from the command line directly, however, seems fine:

thcipriani@staging-restbase01:~$ nodejs /usr/lib/restbase/deploy/restbase/server.js -c /etc/restbase/config.yaml
{"name":"restbase","hostname":"staging-restbase01","pid":8441,"level":30,"msg":"master(8441) initializing 1 workers","time":"2015-07-01T13:05:45.280Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Adding host 10.68.17.67","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.560Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Getting first connection","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.563Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.608Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Trying to use protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.621Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Control connection using protocol version 2","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.762Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Connection acquired, refreshing nodes list","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Refreshing local and peers info","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.763Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Local info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.778Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Peers info retrieved","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.791Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"Retrieving keyspaces metadata","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.792Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"ControlConnection","info":"ControlConnection connected and up to date","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.839Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"message":"Connection","info":"Connecting to 10.68.17.67:9042","levelPath":"info/table/cassandra/driver","msg":"","time":"2015-07-01T13:05:46.844Z","v":0}
{"name":"restbase","hostname":"staging-restbase01","pid":8447,"level":30,"msg":"listening on port 7231","time":"2015-07-01T13:05:47.409Z","v":0}

I'm out the rest of this week @demon can you take a stab at these errors?

Had to sudo systemctl enable restbase.service, works now. Question is why puppet didn't do this for us?

In T104276#1417765, @demon wrote:

Had to sudo systemctl enable restbase.service, works now. Question is why puppet didn't do this for us?

This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.

thcipriani triaged this task as High priority.Jul 6 2015, 6:38 PM

thcipriani moved this task from To Triage to In-progress on the Deployments board.

This is a safety measure put in place to disable starting RESTBase on boot because Cassandra is not started on boot either since that might harm the data and corrupt it.

It also makes sure that restbase is not started up with out-of-date code when a node comes back after missing the last deploys.

• mmodell added a parent task: T93433: Evaluate Ansible as a deployment tool.Jul 6 2015, 8:05 PM

So I created all restbase instances (staging-restbase{01-10}); however, in trying to get the cassandra cluster built I'm running into some issues. First I ran sudo salt 'staging-restbase*' cmd.run 'rm -rf -- /var/lib/cassandra/*' then ran: sudo salt 'staging-restbase*' cmd.run 'systemctl start cassandra' Initially I came out with:

 root@staging-restbase01:/var/lib/cassandra# nodetool -h 10.68.17.67 status
 Datacenter: datacenter1
 =======================
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address       Load       Tokens  Owns    Host ID                               Rack
UJ  10.68.17.171  72.57 KB   256     ?       18a4bf5d-b74a-4129-9fb7-05a70f5afb45  rack1
DN  10.68.17.137  ?          256     ?       9d1291ea-5097-4df7-bd12-557f0b669e0a  rack1
DN  10.68.17.83   ?          256     ?       c00fb6b0-8518-4455-9f41-2929cbcfe7c5  rack1
UN  10.68.17.67   1.61 MB    256     ?       ba9f411c-76d3-4eaa-9bc9-bfd518de659d  rack1
UN  10.68.17.179  73.71 KB   256     ?       9200774f-1570-4a31-9198-4a85653ba2f8  rack1
UN  10.68.17.176  68.37 KB   256     ?       99e52f1b-369b-4a5f-b91f-3d21b0134d1c  rack1

Then, after rerunning the commands (rm -rf -- /var/lib/cassandra && systemctl start cassandra) now I can't get any nodes to come up via systemctl. They all seem to work just running /usr/sbin/cassandra.

@thcipriani, you need to temporarily add one node to its own seeds in the cassandra config (/etc/cassandra/cassandra.yaml) in order to let it start up. All other nodes (and itself, when it re-joins later) will then seed from the other nodes in the cluster. This is to

a) make sure new nodes bootstrap properly before joining the cluster and accepting requests, and
b) avoid a stand-alone node form its own cluster without operator intervention (think split-brain).

Also, you should start up one cluster node at a time and wait until it is fully bootstrapped before proceeding.

@GWicke thanks for the tips—finally got everything up and running.

staging-restbase{01..10} is now up and running on debian jessie. All are in the same cassandra cluster with clean puppet runs.

Seems the restbase service doesn't necessarily need to be enabled to run, anything that triggers systemctl daemon-reload does the trick—systemd just didn't see the /etc/init.d/restbase file, I guess. systemctl enable restbase.service triggers systemctl daemon-reload somewhere along the way, I guess.

Also, this was strange systemd/sysvinit compatiblity behavior, the cassandra service wouldn't start once it had been stop, but if you called restart it worked fine:

 root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago

root@staging-restbase05:~# systemctl start cassandra
root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (exited) since Tue 2015-07-07 00:16:42 UTC; 19h ago

root@staging-restbase05:~# systemctl restart cassandra
root@staging-restbase05:~# systemctl status cassandra
● cassandra.service - LSB: distributed storage system for structured data
   Loaded: loaded (/etc/init.d/cassandra)
   Active: active (running) since Tue 2015-07-07 19:18:30 UTC; 2s ago
  Process: 31409 ExecStop=/etc/init.d/cassandra stop (code=exited, status=0/SUCCESS)
  Process: 31444 ExecStart=/etc/init.d/cassandra start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cassandra.service
           └─31543 java -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1005M -Xmx100...

Nodetool seems to indicate that everything looks good, on to getting a deploy working on staging.

root@staging-restbase01:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address       Load       Tokens  Owns    Host ID                               Rack
UN  10.68.17.137  670.46 KB  256     ?       0500afd5-0e0c-434c-9482-2c92ad1949b0  rack1
UN  10.68.17.171  721.21 KB  256     ?       70006337-670f-420d-90a3-3778dbc0021c  rack1
UN  10.68.17.173  1.76 MB    256     ?       58c58776-ce55-4d6b-b01b-ed5f14e92d03  rack1
UN  10.68.17.83   1.16 MB    256     ?       ef28c9c1-ae2c-4b28-8f88-6fb1d3b180d7  rack1
UN  10.68.17.172  1.63 MB    256     ?       687b6271-3d0a-4a4f-8684-6815be0f550f  rack1
UN  10.68.17.67   672.14 KB  256     ?       57d8fa90-d41b-489a-8245-ae2b569ceb42  rack1
UN  10.68.17.179  1.21 MB    256     ?       2b3d5faa-15b2-413a-b6cd-c57b16f17b44  rack1
UN  10.68.17.176  1.69 MB    256     ?       5598cfc6-0790-4bb7-ad6c-36e3959d8ad5  rack1
UN  10.68.17.167  667.13 KB  256     ?       05bf7301-9f0d-4719-bfc4-a0ad3de9ad32  rack1
UN  10.68.17.183  1.29 MB    256     ?       dea82113-3ace-483d-be46-11937dc737f8  rack1

@thcipriani, to test our current ansible system, you need to

add a new inventory file similar to https://github.com/wikimedia/ansible-deploy/blob/master/staging ('labs-staging'?)
use that with -i instead of our staging cluster
run ansible-playbook -i labs-staging roles/restbase/setup.yml to switch from trebuchet to ansible
run ansible-playbook -i labs-staging roles/restbase/deploy.yml to deploy

Added the file labs-staging to the root of the ansible deploy repo with the contents:

[eqiad:children]
eqiad-restbase

[restbase:children]
eqiad-restbase

[eqiad-restbase]
staging-restbase[01:10].eqiad.wmflabs

After that the command ansible-playbook -i labs-staging roles/restbase/deploy.yml -vvv seemed to work ok.

Ran into the error Fatal: shared connection to [host] closed a couple of times, will investigate more, but for now marking this task as resolved.

Full session log: https://dpaste.de/jhS6

thcipriani mentioned this in T167833: Zuul refused to start from contint1001.Jun 14 2017, 5:24 PM

Setup staging for testing RESTBase deploysClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Setup staging for testing RESTBase deploys
Closed, ResolvedPublic
Actions

Related Objects
Search...