Page MenuHomePhabricator

eventstreams cannot be deployed and its deployments will need to be destroyed and recreated
Closed, ResolvedPublic

Assigned To
Authored By
Joe
Nov 30 2022, 8:17 AM
Referenced Files
F35825355: image.png
Nov 30 2022, 12:36 PM
F35825322: image.png
Nov 30 2022, 11:45 AM
F35825309: image.png
Nov 30 2022, 11:30 AM
F35825294: image.png
Nov 30 2022, 11:18 AM

Description

After https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/831957 was merged, the matchLabels of the Deployment object for eventstreams depend on the chart id, which includes the chart version.

Given in kubernetes matchLabels for a deployment are immutable (for pretty obvious reasons...), we would now need to:

  1. remove the chartid from matchLabels
  2. depool eventstreams in a datacenter
  3. destroy and recreate the deployment there
  4. re-pool the dc
  5. rinse and repeat in the other datacenter.

Marking as UBN! as eventstreams can't be deployed right now and that seems serious enough to need a fix immediately.

Related Objects

Event Timeline

Joe triaged this task as Unbreak Now! priority.Nov 30 2022, 8:21 AM
Joe added projects: Kubernetes, Data-Engineering, SRE.
Joe updated the task description. (Show Details)
Joe added subscribers: Ottomata, gmodena.

Change 862222 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove chartid from matchlabels for eventstreams

https://gerrit.wikimedia.org/r/862222

Adding various Data Engineering planning and streaming tags for visibility.
Expediting into current sprint due to its being an unbreak now.

Change 862222 merged by jenkins-bot:

[operations/deployment-charts@master] Remove chartid from matchlabels for eventstreams

https://gerrit.wikimedia.org/r/862222

I have received guidance from @Clement_Goubert on the steps required to depool/destroy/deploy/repool eventstreams in each data centre: P41870

In codfw and eqiad we have canary releases though, so I can apply the change with additional steps based on adding --selector name=canary and --selector name=production to the helmfile commands.

I have merged the linked CR so I will begin this process now.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:11:04Z] <btullis> depooling codfw for eventstreams for T324074

New eventstreams clients in codfw have virtually stopped. Existing clients are draining slowly. I will wait until they have all gone.

image.png (462×1 px, 77 KB)

eventstreams in codfw has been completely drained.

image.png (462×1 px, 82 KB)

Proceeding to destroy and redeploy the service.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:32:14Z] <btullis> destroying the eventstreams deployment in codfw and reapplying for T324074

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:34:55Z] <btullis> repooling codfw for eventstreams for T324074

eventreams in codfw is now handling traffic again nicely.

image.png (462×1 px, 81 KB)

Proceeding to depool eqiad.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:59:20Z] <btullis> depooling eqiad for eventstreams for T324074

Depooled eqiad using the cookbook method, which was new to me.

btullis@cumin1001:~$ sudo cookbook sre.discovery.service-route depool eqiad eventstreams
START - Cookbook sre.discovery.service-route
TTL already set to 300, nothing to do
Setting pooled=False for tags: {'dnsdisc': '(eventstreams)', 'name': 'eqiad'}
Waiting 300.00 seconds for DNS changes to propagate
Expected routes:
eventstreams: codfw
Checking that eventstreams.discovery.wmnet records for eqiad matches eventstreams.svc.codfw.wmnet (10.2.1.34)
Checking that eventstreams.discovery.wmnet records for codfw matches eventstreams.svc.codfw.wmnet (10.2.1.34)
END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)

Now awaiing full draining of eventstreams in eqiad.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T12:29:53Z] <btullis> repooling eqiad for eventstreams for T324074

All looks good with eventstreams in eqiad again.

image.png (462×1 px, 71 KB)

Thank you both!

A pleasure.