Page MenuHomePhabricator

Scap3: updates, upgrades, and challenges
Closed, ResolvedPublic

Description

Scap3 has come quite a long way over the past quarter: https://phabricator.wikimedia.org/tag/scap3/

The first deploy via Scap3 happened on beta cluster last Wednesday (outcomes outlined here: https://www.mediawiki.org/wiki/Deployment_tooling/Cabal/RESTBase_Beta_deploy). The cabal et al are moving forward with more deployments and by the time the dev summit rolls around there should be a lot to discuss. We'd love to spread the good word about the work that RelEng has been doing over several quarters with interested deployers and opsen that have concerns or want to help push the project forward.

The outline below tries to expand on some of the points for discussion in the task description. Basically, the session at the Dev Summit has a few audiences:

  1. Opsen who can help move MediaWiki and other projects to a more automated deployment.
  2. Repo deployers whose repositories haven't yet (by that point) moved to Scap3.

Dev summit discussion:

  1. Explain how to move a repository from using Trebuchet to Scap3
  2. Find some other "next-step" repos—there should be a first handful of repos on Scap by the time of the dev summit
  3. How Scap3 would help prevent MediaWiki outages (see: T116593#1755029)
  4. What has to happen to get Scap3 deploying MediaWiki?

Hopeful Dev-summit Outcomes

  • Clear path forward to deploying MediaWiki with Scap3
  • Clear path forward to reducing the number of deployment tools—how does Trebuchet go away?
  • Better understanding for interested deployers of how to port a repo from Trebuchet to Scap3 for those project maintainers that would be interested

Event Timeline

thcipriani raised the priority of this task from to Needs Triage.
thcipriani updated the task description. (Show Details)
thcipriani added subscribers: thcipriani, dduvall, greg and 2 others.

What does the migration process to Scap3 looks like?

That one can probably be presented soonish since we are apparently going to use scap3 to deploy RESTBase on beta cluster soonish.

Automated beta-cluster deployment via Jenkins

I am pretty sure we can do that in October/November. Depends on how well scap3/RESTBase went.

Congratulations! This is one of the 52 proposals that made it through the first deadline of the Wikimedia-Developer-Summit-2016 selection process. Please pay attention to the next one: > By 6 Nov 2015, all Summit proposals must have active discussions and a Summit plan documented in the description. Proposals not reaching this critical mass can continue at their own path out of the Summit.

Hi @thcipriani, this proposal is focusing on a Summit session but there is no indication about topics that could be discussed here before, and therefore it is missing active discussion now. Note that pre-scheduled Summit sessions are expected to be preceded by online discussions and a plan to reach to conclusions and next steps. It would be good to sort out these problems before the next deadline on November 6.

The first deploy via Scap3 happened on beta cluster last Wednesday (outcomes outlined here: https://www.mediawiki.org/wiki/Deployment_tooling/Cabal/RESTBase_Beta_deploy). The cabal et al are moving forward with more deployments and by the time the dev summit rolls around there should be a lot to discuss. We'd love to spread the good word about the work that RelEng has been doing over several quarters with interested deployers and opsen that have concerns or want to help push the project forward.

The outline below tries to expand on some of the points for discussion in the task description. Basically, the session at the Dev Summit has a few audiences:

  1. Opsen who can help move MediaWiki and other projects to a more automated deployment.
  2. Repo deployers whose repositories haven't yet (by that point) moved to Scap3.

Dev summit discussion:

  1. Explain how to move a repository from using Trebuchet to Scap3
  2. Find some other "next-step" repos—there should be a first handful of repos on Scap by the time of the dev summit
  3. How Scap3 would help prevent MediaWiki outages (see: T116593#1755029)
  4. What has to happen to get Scap3 deploying MediaWiki?

Hopeful Dev-summit Outcomes

  • Clear path forward to deploying MediaWiki with Scap3
  • Clear path forward to reducing the number of deployment tools—how does Trebuchet go away?
  • Better understanding for interested deployers of how to port a repo from Trebuchet to Scap3 for those project maintainers that would be interested

Scap3 was developed chiefly to replace (and improve upon) Trebuchet as a deployment tool for Services but with a general enough architecture to serve MediaWiki deployments. We've had invaluable insight on the latter from @mmodell throughout planning and implementation but, nonetheless, we feel that a focused conversation with other experienced MW engineers/opsen (e.g. @bd808, @Krenair, @csteipp, @Catrope, @GWicke) around @thcipriani's aforementioned topics would benefit all stakeholders (bingo, anyone?) of the Scap toolchain.

If current (and new) subscribers could signal their willingness/interest (or disinterest) to engage in such a conversation at the Dev Summit, we would greatly appreciate it. And we look forward to the conversation, whenever/wherever it may occur!

Babygirl.md7565 set Security to Access Request.

Notes from session:

Prompt: Talk about work with deploying MW w/ scap or migrating existing repo to use scap3?

Migrating

Similar concepts to Trebechet

  • config lives in code repo
  • git based deployment
  • some assumptions about deployment target
  • Migration from Trebuchet presents issues currently
    • Trebuchet is salt based, flakey, different arch
    • Existing Puppet provider exists for Trebuchet
      • Need to write one for scap3

Deployments with scap3

Atomic

  • checkouts go to new directory
  • updates single cache git directory and clones locally from that
  • should be pointed out that scap3 provides both serial and parallel deployment strategies
    • configurable for each stage (e.g. fetch in parallel, promote serially)

Checks

  • health checks are executed after each stage of deployment
    • can be commands or ops provided nagios checks

Config deployment

    • w/ Trebuchet, code was deployed but not config
    • keep templates in your repo
    • can execute config deployment independently
    • Jinja2 used for templating, and can reference (sensitive) variables that are supplied by ops/puppet
  • question: have you considered using pupppet for config deploy/templating
    • considered but seemed too big of a dependency
    • comment: sounds like we're building a config management system
      • sort of, a small piece is essentially config mgt

question: does it support fanout?

  • that's part of MW implementation

question: support for (de)pooling?

  • in the works, support for mocking in Beta Cluster
  • comments regarding pooling/depooling vs proxy/queues for requests until services are restarted

Canary deploys

  • you can define canary/deployment groups
    • have tiered rollout of groups
    • bail on the first failure
    • rollback
    • implemented as general deployment "groups"

RelEng is around to help teams migrate. Ask Greg G or anyone from RelEng. :)

One of the things Trebuchet had was a store for the current state of a deployment

  • How will scap3 provide this for newly provisioned nodes?
    • Being worked on but we need to work out the kinks in the provider

Trebuchet failed because it coulded deploy MW

  • it would get into a bad state
  • pull and bad state is going to be a problem with any system
  • current idea for scap3 fanout is going to be pull based

We've been thinking about different ideas for transport of repo

  • bittorrent for example
  • we've been pretty agnostic about our transport implementation
    • if we decide our current approach is wrong, we should be able to implement sometihng different
  • Biggest problem historically has been localization
  • Have you looked at ?-db (rocksDB?) to replace cdb?
  • We've gone the direction of git-annex
    • that exists in Trebuchet using git-fat
  • We're looking into ways to make cdb unecessary for l10n cache using straight PHP/HHVM – https://phabricator.wikimedia.org/T99740
    • Authoritative mode, not a prereq

BT transport

  • only one implementation support (bittornado)
  • you don't know when you're done seeding
  • could run on the system persistently
  • we've only played around with it, looks promissing but not the only way to go
  • doesn't include dot files by default. might have to patch
  • you might have to tar/untar compress/decompress

We haven't gotten into MW deploys too much

  • We mainly targetted replacing Trebuchet
  • offline --

Timo: Strategy needed for failed fetched/checkouts, pull vs. fetch

@thcipriani: do you want to claim and close this one? I think we can call this a success.