Page MenuHomePhabricator

Refactor our helmfile.d dir structure for services
Closed, ResolvedPublic

Description

Right now, our structure has several flaws, namely:

  • extreme repetition
  • no centralized place to define common values (say, the logstash hostname, or the version of envoy we want to install) which leads to horrors such as https://gerrit.wikimedia.org/r/q/topic:%22envoy_1.14.4%22+(status:open%20OR%20status:merged) whenever SREs need to change something across the board
  • Mechanics of releasing are cumbersome (you have to change directory, source a file, run helmfile sync for *every* cluster you want to release to)
  • Releasing has several nice footguns, including the fact that if you forget to source the correct env variables you risk applying the values of one cluster to another one without noticing

Our goal with a refactor is to:

  • Remove the ambiguity between the cluster we're working on and what values we're applying: it should not be possible to apply codfw values to eqiad and vice-versa by mistake.
  • Reduce repetition to the bare minimum that makes sense
  • Allow easier management of SRE-driven changes/deployments
  • Simplify the release process for users.

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+18 -38
operations/puppetproduction+138 -146
operations/deployment-chartsmaster+16 -176
operations/deployment-chartsmaster+70 -792
operations/deployment-chartsmaster+2 -0
operations/deployment-chartsmaster+73 -450
operations/deployment-chartsmaster+85 -471
operations/deployment-chartsmaster+1 -1
operations/deployment-chartsmaster+343 -962
operations/deployment-chartsmaster+51 -224
operations/deployment-chartsmaster+242 -592
operations/deployment-chartsmaster+60 -177
operations/deployment-chartsmaster+51 -224
operations/deployment-chartsmaster+162 -382
operations/deployment-chartsmaster+60 -247
operations/deployment-chartsmaster+66 -453
operations/deployment-chartsmaster+53 -150
operations/deployment-chartsmaster+53 -114
operations/deployment-chartsmaster+6 -5
operations/deployment-chartsmaster+55 -158
operations/deployment-chartsmaster+60 -213
operations/deployment-chartsmaster+12 -0
operations/deployment-chartsmaster+160 -1
operations/deployment-chartsmaster+77 -51
operations/deployment-chartsmaster+137 -34
operations/deployment-chartsmaster+89 -89
operations/puppetproduction+75 -34
operations/deployment-chartsmaster+55 -162
operations/debs/helmmaster+57 -0
operations/debs/helm-diffmaster+7 -1
operations/debs/helm-diffmaster+44 -0
operations/debs/helmfilemaster+193 -0
operations/debs/helm-diffmaster+217 -0
integration/configmaster+8 -4
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 618501 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm-diff@master] Add postinst to clean up after old package version

https://gerrit.wikimedia.org/r/618501

Change 618501 merged by jenkins-bot:
[operations/debs/helm-diff@master] Add postinst to clean up after old package version

https://gerrit.wikimedia.org/r/618501

Change 618524 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm-diff@master] Fix if in postinst

https://gerrit.wikimedia.org/r/618524

Change 618524 merged by jenkins-bot:
[operations/debs/helm-diff@master] Fix if in postinst

https://gerrit.wikimedia.org/r/618524

Updated helm, helmfile and helm-diff to their latest versions on deploy* and contint*

Change 618556 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/debs/helm@master] Detect kubeconfig as known argument in plugin invocations

https://gerrit.wikimedia.org/r/618556

Change 618556 merged by jenkins-bot:
[operations/debs/helm@master] Detect kubeconfig as known argument in plugin invocations

https://gerrit.wikimedia.org/r/618556

Change 615498 merged by jenkins-bot:
[operations/deployment-charts@master] helmfile: refactoring blubberoid

https://gerrit.wikimedia.org/r/615498

Change 619448 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] deployment_server::helmfile: generate files for new-style helmfile organization

https://gerrit.wikimedia.org/r/619448

Change 619448 merged by Giuseppe Lavagetto:
[operations/puppet@production] deployment_server::helmfile: generate files for new-style helmfile organization

https://gerrit.wikimedia.org/r/619448

FTR: I'm going to do mathoid eventgate-analytics as it's probably good to do something more complicated next to uncover potential shortcomings

Change 619437 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] helmfile.d: refactor eventgate-main

https://gerrit.wikimedia.org/r/619437

Change 620934 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Test deployments with helmfile lint

https://gerrit.wikimedia.org/r/620934

Change 620935 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Switch all charts from "stable" to "wmf-stable"

https://gerrit.wikimedia.org/r/620935

Change 621286 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] helmfile: refactor eventgate-analytics

https://gerrit.wikimedia.org/r/621286

What I more or less did and what probably help others to review and produce CRs that are "best we can get"-ish to review:

SERVICE=eventgate-analytics

git checkout master
cd helmfile.d/services

# generate pre migrate templates
mkdir -p /tmp/helmfile.d/${SERVICE}/
for ENV in staging codfw eqiad; do
    helmfile -f ${ENV}/${SERVICE}/helmfile.yaml template > /tmp/helmfile.d/${SERVICE}/${ENV}.yaml
done

git checkout -b helmfile_refactor_${SERVICE}
mkdir $SERVICE
git mv ${ENV}/${SERVICE}/helmfile.yaml  ${SERVICE}/helmfile.yaml

# Update helmfile.yaml with the new-style version
# chart: ... in helmfile.yaml might be wrong now (in case its not the same as the service name
sed "s/blubberoid/${SERVICE}/g" -i blubberoid/helmfile.yaml > ${SERVICE}/helmfile.yaml

# Check for differences
md5sum {codfw,eqiad,staging}/${SERVICE}/values*
# Lot's of unpredictable manual editing, moving and deleting
git mv ...
git rm ... ...
$EDITOR ... ...

# diff pre and post migration templates (should be equal apart from renaming the release "production" in environment "staging" to "staging".)
for ENV in staging codfw eqiad; do
    helmfile -f ${SERVICE}/helmfile.yaml -e ${ENV} template | colordiff -Nur /tmp/helmfile.d/${SERVICE}/${ENV}.yaml - | less
done

Current state is that we have 3 environments (staging, codfw, eqiad) where we usually deploy 1 to N releases each:

  • production
  • canary (optional)
  • test (optional)

This leads to the staging environment having a release called "production" deployed as well, which is not a good choice naming wise.

We propose to change the names for releases as follows:

  • "main" (former "production"), present in all environments
  • "canary" (optional), for the canary part of the release, may be present in all environments
  • "test" (optional), for test releases that would also include separate service definitions, may only be used in staging environment

The downside of renaming the release for production environment is that we need to deal with a bit of downtime, so it would be nice to have T260663 available.

+1 to the proposal.

The downside of renaming the release for production environment is that we need to deal with a bit of downtime

Maybe there isn't a huge need to have downtime for this, but instead we can try to adopt this convention for new charts, and for those willing to change (like me with eventgate + eventstreams)?

If we want to adopt it, it should be documented in the helmfile.d README I think, and probably on wikitech as well. +1 from me.

Change 620935 merged by Giuseppe Lavagetto:
[operations/deployment-charts@master] Switch all charts from "stable" to "wmf-stable"

https://gerrit.wikimedia.org/r/620935

Change 620934 merged by jenkins-bot:
[operations/deployment-charts@master] Test deployments with helmfile lint

https://gerrit.wikimedia.org/r/620934

Change 622582 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Refresh the documentation of the helmfile.d/services

https://gerrit.wikimedia.org/r/622582

Change 622583 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Add an helper script for the conversion to the new layout.

https://gerrit.wikimedia.org/r/622583

Change 622584 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert termbox to the new layout using the convert script

https://gerrit.wikimedia.org/r/622584

Change 622585 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert citoid to new layout using the conversion script

https://gerrit.wikimedia.org/r/622585

Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the ability to specify image updates for an environment in the .pipeline/config.yaml. I'm wondering if I should also add this ability for releases as well...Is it likely we'll want to have an image version override in the release values.yaml?

Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the ability to specify image updates for an environment in the .pipeline/config.yaml. I'm wondering if I should also add this ability for releases as well...Is it likely we'll want to have an image version override in the release values.yaml?

Currently there's only a hand full of charts using multiple releases per environment. So I would say you can start with what you have and add multi-release later on. Assuming that it is still possible to do the image update manually, ofc.

Hi, I'm working on https://phabricator.wikimedia.org/T255835, but I've only added the ability to specify image updates for an environment in the .pipeline/config.yaml. I'm wondering if I should also add this ability for releases as well...Is it likely we'll want to have an image version override in the release values.yaml?

In the current proposed hierarchy, environments come before the releases, so a deployer will still be able to modify what goes to a specific release manually. I think it's pretty ok to assume you can avoid implementing it now.

Change 622806 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Convert mathoid to the new layout using the convert script

https://gerrit.wikimedia.org/r/622806

Change 622823 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] helmfile_convert_diff.sh copy old helmfiles from parent on review

https://gerrit.wikimedia.org/r/622823

Change 622582 merged by Giuseppe Lavagetto:
[operations/deployment-charts@master] Refresh the documentation of the helmfile.d/services

https://gerrit.wikimedia.org/r/622582

Change 622583 merged by Giuseppe Lavagetto:
[operations/deployment-charts@master] Add an helper script for the conversion to the new layout.

https://gerrit.wikimedia.org/r/622583

Change 622823 merged by jenkins-bot:
[operations/deployment-charts@master] helmfile_convert_diff.sh copy old helmfiles from parent on review

https://gerrit.wikimedia.org/r/622823

Change 622827 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Align all exsiting new-style helmfiles to example

https://gerrit.wikimedia.org/r/622827

Change 622584 merged by jenkins-bot:
[operations/deployment-charts@master] Convert termbox to the new layout using the convert script

https://gerrit.wikimedia.org/r/622584

Change 622585 merged by Giuseppe Lavagetto:
[operations/deployment-charts@master] Convert citoid to new layout using the conversion script

https://gerrit.wikimedia.org/r/622585

Change 623739 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert mobileapps to the new layout

https://gerrit.wikimedia.org/r/623739

Change 622827 merged by jenkins-bot:
[operations/deployment-charts@master] Align all exsiting new-style helmfiles to example

https://gerrit.wikimedia.org/r/622827

Change 623749 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert cxserver to the new layout

https://gerrit.wikimedia.org/r/623749

Change 623739 merged by jenkins-bot:
[operations/deployment-charts@master] Convert mobileapps to the new layout

https://gerrit.wikimedia.org/r/623739

Change 622806 merged by jenkins-bot:
[operations/deployment-charts@master] Convert mathoid to the new layout using the convert script

https://gerrit.wikimedia.org/r/622806

Change 624007 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert echotore to the new layout

https://gerrit.wikimedia.org/r/624007

Change 624008 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert proton to the new layout

https://gerrit.wikimedia.org/r/624008

Change 624010 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert zotero to the new layout

https://gerrit.wikimedia.org/r/624010

Change 624017 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Convert sessionstore to the new layout

https://gerrit.wikimedia.org/r/624017

Change 621286 merged by jenkins-bot:
[operations/deployment-charts@master] helmfile: refactor eventgate-analytics

https://gerrit.wikimedia.org/r/621286

Change 623749 merged by jenkins-bot:
[operations/deployment-charts@master] Convert cxserver to the new layout

https://gerrit.wikimedia.org/r/623749

Change 624007 merged by jenkins-bot:
[operations/deployment-charts@master] Convert echostore to the new layout

https://gerrit.wikimedia.org/r/624007

Change 624008 merged by jenkins-bot:
[operations/deployment-charts@master] Convert proton to the new layout

https://gerrit.wikimedia.org/r/624008

Change 624010 merged by jenkins-bot:
[operations/deployment-charts@master] Convert zotero to the new layout

https://gerrit.wikimedia.org/r/624010

Change 624017 merged by jenkins-bot:
[operations/deployment-charts@master] Convert sessionstore to the new layout

https://gerrit.wikimedia.org/r/624017

Change 624854 had a related patch set uploaded (by JMeybohm; owner: JMeybohm):
[operations/deployment-charts@master] Revert "Revert "Convert proton to the new layout""

https://gerrit.wikimedia.org/r/624854

Change 624854 merged by jenkins-bot:
[operations/deployment-charts@master] Revert "Revert "Convert proton to the new layout""

https://gerrit.wikimedia.org/r/624854

I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622585

The eqiad file was renamed to just values, but I noticed in the examples folder there's both values.yaml and values-eqiad.yaml

So when I do a deploy, will I be able to deploy to eqiad, or is this a mistake and there should also be a separate eqiad file in this change in order to deploy to eqiad?

I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622585

The eqiad file was renamed to just values, but I noticed in the examples folder there's both values.yaml and values-eqiad.yaml

So when I do a deploy, will I be able to deploy to eqiad, or is this a mistake and there should also be a separate eqiad file in this change in order to deploy to eqiad?

You're totally able to deploy to eqiad without a values-eqiad.yaml. Those environment specific values files can be used to overwrite settings (for that environment) but are not required. Please see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/README.MD for a description on which files are applied in which order (please tell if anything is unclear or missing! :))

I'm planning to deploy tomorrow, so I was wondering if I can have clarification on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/622585

The eqiad file was renamed to just values, but I noticed in the examples folder there's both values.yaml and values-eqiad.yaml

So when I do a deploy, will I be able to deploy to eqiad, or is this a mistake and there should also be a separate eqiad file in this change in order to deploy to eqiad?

You're totally able to deploy to eqiad without a values-eqiad.yaml. Those environment specific values files can be used to overwrite settings (for that environment) but are not required. Please see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/README.MD for a description on which files are applied in which order (please tell if anything is unclear or missing! :))

Great, thanks!

Change 619437 merged by jenkins-bot:
[operations/deployment-charts@master] helmfile.d: refactor eventgate-main

https://gerrit.wikimedia.org/r/619437

Change 628082 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-main - use proper path to private values file in helmfile.yaml

https://gerrit.wikimedia.org/r/628082

Change 628082 merged by Ottomata:
[operations/deployment-charts@master] eventgate-main - use proper path to private values file in helmfile.yaml

https://gerrit.wikimedia.org/r/628082

Hey, I just noticed that eventgate-main staging doesn't have a canary release anymore. This is fine most of the time, but what if I want to make changes to the canary release helmfile setup? I'd like to have a place to test those changes that is not production. A canary release in staging doesn't help with ensuring that the app deploys fine; it helps to make sure that the helmfile setup is good and the same as production, before actually applying in production.

I'm ok with not having a canary in staging for now, as I'm not doing any active development on the helm bits. Can I add it back in when I do?

@Ottomata do you have the chance to migrate eventgate-analytics-external, eventgate-logging-external and eventstreams in near future? Still sticking to the old structure, they're would require extra care for the latest envoy update (T264157).

Change 634354 had a related patch set uploaded (by Jeena Huneidi; owner: Jeena Huneidi):
[operations/deployment-charts@master] [DNM] Experimental King helmfile

https://gerrit.wikimedia.org/r/634354

I think there are just a few dangling services that are managed by the analytics team, specifically:

  • eventgate-analytics-external
  • eventgate-logging-external
  • eventstreams

I think we've waited long enough to remove the support for the old structure in all of our scripts and in puppet. I will wait another week but on next monday (october 26th) I'll just remove support for the old structure.

@Ottomata @elukey I guess you want to take care of the conversion before your services become undeployable with the old layout.

AH! Sorry was on vaca for 2 week and have really been trying to focus on a lagging goal from last quarter. Ok, I'll prioritize this instead.

Change 634976 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] helmfile.d: refactor eventgate-logging-external

https://gerrit.wikimedia.org/r/634976

Change 634983 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] helmfile.d: refactor eventgate-analytics-external

https://gerrit.wikimedia.org/r/634983

Change 634984 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] helmfile.d: refactor eventstreams

https://gerrit.wikimedia.org/r/634984

Change 634976 merged by Ottomata:
[operations/deployment-charts@master] helmfile.d: refactor eventgate-logging-external

https://gerrit.wikimedia.org/r/634976

Change 634983 merged by Ottomata:
[operations/deployment-charts@master] helmfile.d: refactor eventgate-analytics-external

https://gerrit.wikimedia.org/r/634983

Change 635062 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-analytics-external - set staging replicas to 1

https://gerrit.wikimedia.org/r/635062

Change 635062 merged by Ottomata:
[operations/deployment-charts@master] eventgate-analytics-external - set staging replicas to 1

https://gerrit.wikimedia.org/r/635062

Change 634984 merged by Ottomata:
[operations/deployment-charts@master] helmfile.d: refactor eventstreams

https://gerrit.wikimedia.org/r/634984

Ok! Done.

Thanks a ton, I'm going to remove all the special casing both in puppet and deployment-charts now.

Change 638415 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/deployment-charts@master] Remove old-style helmfile structure from CI

https://gerrit.wikimedia.org/r/638415

Change 638415 merged by jenkins-bot:
[operations/deployment-charts@master] Remove old-style helmfile structure from CI

https://gerrit.wikimedia.org/r/638415

Change 638438 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::kubernetes::deployment_server: cleanup the old layout

https://gerrit.wikimedia.org/r/638438

Change 638439 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] profile::kubernetes::deployment_server::helmfile: simplify code

https://gerrit.wikimedia.org/r/638439

Change 638438 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::kubernetes::deployment_server: cleanup the old layout

https://gerrit.wikimedia.org/r/638438

Change 638439 merged by Giuseppe Lavagetto:
[operations/puppet@production] profile::kubernetes::deployment_server::helmfile: simplify code

https://gerrit.wikimedia.org/r/638439