Page MenuHomePhabricator

eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy
Closed, DuplicatePublic

Event Timeline

Change 722654 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/deployment-charts@master] Eventgate: Symlink _helpers and _tls_helpers

https://gerrit.wikimedia.org/r/722654

Change 722935 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[operations/deployment-charts@master] Update eventgate helmfile.d for eventgate 0.5 chart

https://gerrit.wikimedia.org/r/722935

Change 722654 merged by Ottomata:

[operations/deployment-charts@master] Eventgate: Symlink _helpers and _tls_helpers

https://gerrit.wikimedia.org/r/722654

Change 722935 merged by Ottomata:

[operations/deployment-charts@master] Update eventgate helmfile.d for eventgate 0.5 chart

https://gerrit.wikimedia.org/r/722935

We tried to deploy this today, but ran into an issue: Since the k8s resources have been renamed, k8s thinks the e.g Service is new, but sees the old Service on the same port, causing a port conflict.

To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repool.

I'd like to talk with someone about https://phabricator.wikimedia.org/T282148#7373078 before we do, to make sure we don't have to do that kind of failover deployment more than once.

To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repool.

Since it's such a hassle, should we do the same thing for event streams chart first?

Oof right. I've already merged the eventgate chart change, and I think to rollback we'd have to revert and then bump the chart version to 0.6.0.

Grr, I guess we should rollback, right?

why rollback? we just make the same changes to eventstreams before going through the deployment

I'm worried that in the meantime someone will need to make an emergency fix/change to eventgate and won't be able to because of this.

I think we should just proceed with eventgate, I'll do staging in each first. Will have to delete in staging and redeploy.

Plan for staging:

helmfile -e staging destroy
# wait and make sure all is gone.
helmfile -e staging apply

For eqiad and codfw (from https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service)

First lower DNS ttl for the eventgate deployment:

puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=10

Deploy in codfw:

puppetmaster1001$ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external,name=codfw' set/pooled=false
# make sure codfw is depooled
puppetmaster1001$ sudo confctl --quiet --object-type discovery select 'dnsdisc=eventgate-logging-external' get

# delete and re-deploy in codfw
deploy1002$ helmfile -e codfw destroy
deploy1002$ helmfile -e codfw apply
# Wait for k8s service to look good and re pool

puppetmaster1001$ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external,name=codfw' set/pooled=true
# make sure codfw is depooled
puppetmaster1001$ confctl --quiet --object-type discovery select 'dnsdisc=eventgate-logging-external' get
# make sure both eqiad and codfw are pooled

Repeat this process for eqiad.

Reset DNS ttl for the eventgate deployment:

puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=300

Repeat all of this for each eventgate deployment.

Don't forget to wait for the DNS TTL and/or lower the TTL before every depool/repool operation.

so you might want to first do

puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=10

And then re-set it to 300 once you're done with your work.

Thanks, added this step into my comment above.

Ah, there were some mistakes in our patches: the tls Service wasn't using the same label selectors that the pods had. Reverting for now.

Let's discuss T291848: Clarify common k8s label and service conventions in our helm charts a little more before proceeding.

Mentioned in SAL (#wikimedia-operations) [2021-09-27T16:10:45Z] <ottomata> reverting eventgate-logging-external chart change in codfw - T291504

@Ottomata Should we remove this task from Analytics to Data Engineering?