Page MenuHomePhabricator

RESTBase deployment process
Closed, ResolvedPublic

Description

This task lists the concrete steps and functionalities we need / expect from the (future) RESTBase deployment system.

Deployment Process

All of the actions are to be carried out from a deployment host (currently tin). Unlike the current Trebuchet-powered deploy, we do not need the deployment host to be a proxy serving the hosts, they may get their code directly from git/gerrit.

Regular Code Deploy

Most of the time the code to be deployed represents logic improvements and feature additions and as such do not bear any impact on the underlying storage (Cassandra). Here are the needed steps to complete a successful deploy, to be executed in sequence on each host:

  1. depool host
  2. stop RESTBase
  3. send / fetch the code
  4. (re)start RESTBase
  5. wait for it to bind to its port
  6. checks / tests
    1. curl some known endpoints
    2. check the logs and graphite for anomalies after restart
      1. error- and fatal-level log entries
      2. Cassandra connection issues
      3. 5xx request response rates for the given host
  7. repeat all steps for the next host

Abort Mechanism

For regular code deploys, aborting is pretty straightforward. The deployment system should keep track of the hash/tag being currently deployed (without in any way interfering with the repository itself). Should a deploy fail, it simply enforces the previously-known-to-work code tag/branch/hash on all of the hosts sequentially.

Some new features and/or betterment of existing functionalities require a new storage schema to be applied. The schemas are versioned and are applied by RESTBase on start-up. As this is a rather sensitive Cassandra operation, extra measures and precautions need to be taken. Because schemas are versioned, the back-end storage will refuse to apply a schema with a version lower than the currently-present one. This means that in this instance the abort mechanism involves an additional step - a manual commit from the deployer bumping the schema version number of the last stable schema.

Configuration Management

RESTBase's configuration is made up of several parts: ops, code-specific and schema. Ops refers to the part of the configuration controlled by TechOps. This includes mainly host-specific configuration directives such as the list of Cassandra nodes, Parsoid and other *oids' host names. Code-specific configuration includes simple, RESTBase-specific details, such as the pagination size or the list of active modules. Finally, schema configuration directives are those whose changes are reflected in the underlying storage, most notably the addition of new back-ends which need storage and new domains.

Currently, the configuration file is managed as a whole in ops/puppet. While this works well for ops and code-specific configuration bits, it is highly inadequate for schema ones.

Instead, whenever one such configuration directive changes (or is added), the deployment process described earlier must be used. Hence, we require configuration changes to be regarded as code deploys, with rolling deploys and rollback possibilities.

Beta / Staging

Ideally, the instances in the Beta Cluster would always be up-to-date with the newest development changes. Since these can include schema changes as well, the upgrade process should resemble the schema changes deployment process.

The staging environment should contain a tagged/branched version which is to be tested. The deployment process is to follow the ones described above. This could also be the time/place for the system to record the built node module dependencies, so that we can get rid of the superfluous deploy repository (or, at least, hide it from the user). Currently, we use a script to bring it up to date. Furthermore, while not needed, the deployment system to record which type of deployment was chosen by the deployer so that it can later be used when deploying in production.

What we would explicitly need is an automated way of creating / updating the dependencies on each commit, so that the deploy repository can be put out of use. Concretely, each commit can be associated with a dependency artifact which would be deployed together with the source RESTBase code for a given hash/tag.

NOTE: Each environment will need to have its own configuration file
NOTE: We currently already have a small staging cluster where we test changes to RESTBase and Cassandra, however, the staging environment evoked here refers to the new overall staging infrastructure

See also

Event Timeline

mobrovac created this task.Jun 22 2015, 1:05 PM
mobrovac raised the priority of this task from to Medium.
mobrovac updated the task description. (Show Details)
mobrovac added a subscriber: mobrovac.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 22 2015, 1:05 PM
Krenair added a subscriber: Krenair.
mobrovac updated the task description. (Show Details)Jun 22 2015, 1:45 PM
mobrovac set Security to None.
mobrovac updated the task description. (Show Details)Jun 22 2015, 1:55 PM
GWicke updated the task description. (Show Details)Jun 22 2015, 2:13 PM
GWicke updated the task description. (Show Details)Jun 22 2015, 2:20 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)
mobrovac updated the task description. (Show Details)Jun 22 2015, 2:48 PM
mobrovac added a project: Services.
mobrovac edited subscribers, added: GWicke, demon, fgiunchedi and 5 others; removed: Aklapper.
GWicke added a comment.EditedJun 22 2015, 3:04 PM

Our current Ansible-based solution handles the most important parts of this (rolling deploys, health checks, automatic aborts).

Currently missing are:

  • rolling config change deploys
  • extended checks during deploy
  • de-pooling / re-pooling

Config changes are generally well supported in Ansible, but for that to work we'll need access to public and private hiera data (see notes). Extended checks are fairly easy to add locally, but remote checks (graphite, logstash) will need some more preparation work to expose suitable end points. Similarly, de-pooling / re-pooling depends on ongoing etcd / pybal work (see discussion in T100793).

GWicke updated the task description. (Show Details)Jun 22 2015, 3:06 PM
mobrovac updated the task description. (Show Details)Jun 22 2015, 4:45 PM
thcipriani moved this task from To Triage to In-progress on the Deployments board.Jun 22 2015, 6:00 PM
GWicke updated the task description. (Show Details)Jun 26 2015, 3:38 AM
GWicke moved this task from Backlog to Blocked / others on the RESTBase board.Jun 29 2015, 5:23 PM
Krenair edited subscribers, added: Joe; removed: Unknown Object (User).Aug 15 2015, 9:27 PM
akosiaris added a subscriber: akosiaris.

I assume this is now waiting on scap3 migration of restbase.

Removing the Blocked-on-Operations for now.

greg added a comment.Apr 13 2017, 11:28 PM

@mobrovac status of this one?

mobrovac closed this task as Resolved.Apr 14 2017, 2:44 PM
mobrovac claimed this task.
mobrovac edited projects, added User-mobrovac, Services (done); removed Services.

Yup, done, resolving.