Details
Related Objects
Event Timeline
Change 722654 had a related patch set uploaded (by Ppchelko; author: Ppchelko):
[operations/deployment-charts@master] Eventgate: Symlink _helpers and _tls_helpers
Change 722935 had a related patch set uploaded (by Ppchelko; author: Ppchelko):
[operations/deployment-charts@master] Update eventgate helmfile.d for eventgate 0.5 chart
Change 722654 merged by Ottomata:
[operations/deployment-charts@master] Eventgate: Symlink _helpers and _tls_helpers
Change 722935 merged by Ottomata:
[operations/deployment-charts@master] Update eventgate helmfile.d for eventgate 0.5 chart
We tried to deploy this today, but ran into an issue: Since the k8s resources have been renamed, k8s thinks the e.g Service is new, but sees the old Service on the same port, causing a port conflict.
To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repool.
I'd like to talk with someone about https://phabricator.wikimedia.org/T282148#7373078 before we do, to make sure we don't have to do that kind of failover deployment more than once.
To deploy, we are going to have to depool a DC, delete the existing deployment, apply the new one, then repool.
Since it's such a hassle, should we do the same thing for event streams chart first?
Oof right. I've already merged the eventgate chart change, and I think to rollback we'd have to revert and then bump the chart version to 0.6.0.
Grr, I guess we should rollback, right?
why rollback? we just make the same changes to eventstreams before going through the deployment
I'm worried that in the meantime someone will need to make an emergency fix/change to eventgate and won't be able to because of this.
I think we should just proceed with eventgate, I'll do staging in each first. Will have to delete in staging and redeploy.
Plan for staging:
helmfile -e staging destroy # wait and make sure all is gone. helmfile -e staging apply
For eqiad and codfw (from https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service)
First lower DNS ttl for the eventgate deployment:
puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=10
Deploy in codfw:
puppetmaster1001$ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external,name=codfw' set/pooled=false # make sure codfw is depooled puppetmaster1001$ sudo confctl --quiet --object-type discovery select 'dnsdisc=eventgate-logging-external' get # delete and re-deploy in codfw deploy1002$ helmfile -e codfw destroy deploy1002$ helmfile -e codfw apply # Wait for k8s service to look good and re pool puppetmaster1001$ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external,name=codfw' set/pooled=true # make sure codfw is depooled puppetmaster1001$ confctl --quiet --object-type discovery select 'dnsdisc=eventgate-logging-external' get # make sure both eqiad and codfw are pooled
Repeat this process for eqiad.
Reset DNS ttl for the eventgate deployment:
puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=300
Repeat all of this for each eventgate deployment.
Don't forget to wait for the DNS TTL and/or lower the TTL before every depool/repool operation.
so you might want to first do
puppetmaster1001 $ sudo confctl --object-type discovery select 'dnsdisc=eventgate-logging-external' set/ttl=10
And then re-set it to 300 once you're done with your work.
Mentioned in SAL (#wikimedia-operations) [2021-09-27T13:58:39Z] <ottomata> beginning re-deploy of eventgate-logging-external - https://phabricator.wikimedia.org/T291504#7380252
Ah, there were some mistakes in our patches: the tls Service wasn't using the same label selectors that the pods had. Reverting for now.
Let's discuss T291848: Clarify common k8s label and service conventions in our helm charts a little more before proceeding.
Mentioned in SAL (#wikimedia-operations) [2021-09-27T16:10:45Z] <ottomata> reverting eventgate-logging-external chart change in codfw - T291504