Page MenuHomePhabricator

Create production and canary releases for existent eventgate helmfile services
Open, HighPublic

Description

The eventgate chart has been refactored to support canary releases. Along the way, we stopped using the release name to differentiate the different app deployments of eventgate. Instead, we now use main_app.name. We should switch back to the convention of calling the production release of an app 'production', and also add a 'canary' release with a single replica.

  • eventgate-logging-external is not yet in use, so that can happen without any client disruption.

eventgate-analytics and eventgate-main are in use. For those we'll need to deploy the new production and canary releases alongside the 'analytics' and 'main' releases. To avoid port conflict, we'll have to pick new nodePorts for the 'production' release's Service. Once the new releases are deployed, we'll have to switch LVS to use the new ports.

eventgate-main will change ports from http 32192 and https 4292 to http 34192 and https 4492.
eventgate-analytics will change ports from http 31192 and https 4192 to http 35192 and https 4592.

  • Deploy canary and production releases for eventgate-main in all clusters on nodePorts 34192 and 4492
  • Deploy canary and production releases for eventgate-analytics in all clusters on nodePorts 35192 and 4592
  • Create new LVS services for eventgate-analytics on ports 34192 and 4492 and eventgate-main on ports 35192 and 4592 - https://gerrit.wikimedia.org/r/c/operations/puppet/+/572960
  • Make sure Analytics VLAN can reach eventgate-analytics on ports 35192 and 4592
  • Point clients to new LVS ports:
  • Switch the old LVS services to non paging
  • Remove old LVS services on old ports
  • Destroy eventgate-main 'main' and eventgate-analytics 'analytics' Helm releases. Setting installed: false would do the trick.
  • Remove eventgate-main 'main' and eventgate-analytics 'analytics' Helm release configs from services helmfile.yaml files.
  • Remove Analytics VLAN firewall rules for 31192

Event Timeline

Ottomata created this task.Thu, Feb 13, 8:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Feb 13, 8:57 PM

Change 572095 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Configure production and canary release for eventgate-logging-external

https://gerrit.wikimedia.org/r/572095

Change 572095 merged by Ottomata:
[operations/deployment-charts@master] Configure production and canary release for eventgate-logging-external

https://gerrit.wikimedia.org/r/572095

Mentioned in SAL (#wikimedia-operations) [2020-02-13T21:35:57Z] <ottomata> deploying production and canary releases for eventgate-logging-external (and destroying the 'logging-external' release) (safe because eventgate-logging-external is not in use) - T245203

Change 572097 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] eventgate-logging-external - Remove now unused 'logging-external' release

https://gerrit.wikimedia.org/r/572097

Change 572097 merged by Ottomata:
[operations/deployment-charts@master] eventgate-logging-external - Remove now unused 'logging-external' release

https://gerrit.wikimedia.org/r/572097

Just applied this for eventgate-logging-external:

cd /srv/deployment-charts/helmfile.d/services/staging/eventgate-logging-external

# Examine helmfile diff for production and canary.  canary should not deploy a Service.
source .hfenv; helmfile --selector name=production diff
source .hfenv; helmfile --selector name=canary diff

# Destroy logging-external release.  We can do this now for eventgate-logging-external because it is not in use.
source .hfenv; helmfile --selector name=logging-external destroy

# Install production and canary releases
source .hfenv; helmfile --selector name=production apply
source .hfenv; helmfile --selector name=canary apply

# Check that there is only a Service for the production release
$ kubectl get services
NAME                                                TYPE       CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
eventgate-logging-external-production               NodePort   10.64.76.251   <none>        8192:33192/TCP   80s
eventgate-logging-external-production-tls-service   NodePort   10.64.76.215   <none>        4392:4392/TCP    80s

# There should be one canary pod and N production pods:
kubectl get pods
NAME                                                    READY   STATUS    RESTARTS   AGE
eventgate-logging-external-canary-b567bdd54-8x4jr       3/3     Running   0          72s
eventgate-logging-external-production-7c54bffd8-vsqkl   3/3     Running   0          112s
...

Once I verified this worked for staging, I scheduled downtime for icinga alerts for eventgate-logging-external and then repeated the process for codfw and eqiad.

I then merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/572097 to remove the now unused 'logging-external' release.

Will do eventgate-analytics and eventgate-main next week with new nodePorts.

Change 572100 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Configure production and canary release for eventgate-analytics

https://gerrit.wikimedia.org/r/572100

Change 572106 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Configure production and canary release for eventgate-main

https://gerrit.wikimedia.org/r/572106

eventgate-main will change ports from http 32192 and https 4292 to http 34192 and https 4492.
eventgate-analytics will change ports from http 31192 and https 4192 to http 35192 and https 4592.

fdans moved this task from Incoming to Event Platform on the Analytics board.Mon, Feb 17, 4:41 PM

Change 572100 merged by Ottomata:
[operations/deployment-charts@master] Configure production and canary release for eventgate-analytics

https://gerrit.wikimedia.org/r/572100

Change 572106 merged by Ottomata:
[operations/deployment-charts@master] Configure production and canary release for eventgate-main

https://gerrit.wikimedia.org/r/572106

Ottomata updated the task description. (Show Details)Tue, Feb 18, 5:00 PM
Ottomata updated the task description. (Show Details)Tue, Feb 18, 8:05 PM
Ottomata updated the task description. (Show Details)Tue, Feb 18, 8:10 PM

Change 572960 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/puppet@production] Add new LVS services for new eventgate-main and eventgate-analytics ports

https://gerrit.wikimedia.org/r/572960

Ottomata triaged this task as High priority.Tue, Feb 18, 8:11 PM
Ottomata updated the task description. (Show Details)
Ottomata updated the task description. (Show Details)Tue, Feb 18, 8:17 PM

Change 572960 merged by Alexandros Kosiaris:
[operations/puppet@production] Add new LVS services for new eventgate-main and eventgate-analytics ports

https://gerrit.wikimedia.org/r/572960

akosiaris updated the task description. (Show Details)Wed, Feb 19, 2:46 PM

Change 573307 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/mediawiki-config@master] Use new LVS port for eventgate-analytics

https://gerrit.wikimedia.org/r/573307

Change 573309 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[analytics/refinery@master] Change default swift event_service_url to new eventgate-analytics port

https://gerrit.wikimedia.org/r/573309

Ottomata updated the task description. (Show Details)Wed, Feb 19, 3:35 PM
Ottomata updated the task description. (Show Details)Wed, Feb 19, 3:46 PM

Change 562842 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/homer/public@master] Add ports and codfw LVS IP to term eventgate-analytics in analytics-in4

https://gerrit.wikimedia.org/r/562842

Change 562842 merged by Elukey:
[operations/homer/public@master] Add ports and codfw LVS IP to term eventgate-analytics in analytics-in4

https://gerrit.wikimedia.org/r/562842

Change 573317 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/puppet@production] [wdqs] use https and 4592 for eventgate-analytics endpoint

https://gerrit.wikimedia.org/r/573317

Mentioned in SAL (#wikimedia-operations) [2020-02-19T16:05:20Z] <elukey> Update analytics-in4 filter term eventgate for T245203 on cr1/cr2 eqiad

Ottomata updated the task description. (Show Details)Wed, Feb 19, 4:06 PM

Change 573309 merged by Ottomata:
[analytics/refinery@master] Change default swift event_service_url to new eventgate-analytics port

https://gerrit.wikimedia.org/r/573309

Change 573317 merged by Ottomata:
[operations/puppet@production] [wdqs] use https and 4592 for eventgate-analytics endpoint

https://gerrit.wikimedia.org/r/573317

Mentioned in SAL (#wikimedia-operations) [2020-02-19T16:24:55Z] <otto@deploy1001> Started deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist

Mentioned in SAL (#wikimedia-operations) [2020-02-19T16:37:22Z] <otto@deploy1001> Finished deploy [analytics/refinery@e23918a]: Updating eventgate-analytics port (T245203) and also eventlogging whitelist (duration: 12m 27s)

Ottomata updated the task description. (Show Details)Wed, Feb 19, 8:35 PM

Re-deployed our glent esbulk oozie job against refinery versioned 2020-02-19T16.58.16+00.00--scap_sync_2020-02-19_0001. Additionally shipped an update to our airflow scheduler that changes the eventgate port used there as well.

Ottomata updated the task description. (Show Details)Thu, Feb 20, 2:35 PM
Ottomata updated the task description. (Show Details)