Page MenuHomePhabricator

eventstreams cannot be deployed and its deployments will need to be destroyed and recreated
Closed, ResolvedPublic

Assigned To
Authored By
Nov 30 2022, 8:17 AM
Referenced Files
F35825355: image.png
Nov 30 2022, 12:36 PM
F35825322: image.png
Nov 30 2022, 11:45 AM
F35825309: image.png
Nov 30 2022, 11:30 AM
F35825294: image.png
Nov 30 2022, 11:18 AM


After was merged, the matchLabels of the Deployment object for eventstreams depend on the chart id, which includes the chart version.

Given in kubernetes matchLabels for a deployment are immutable (for pretty obvious reasons...), we would now need to:

  1. remove the chartid from matchLabels
  2. depool eventstreams in a datacenter
  3. destroy and recreate the deployment there
  4. re-pool the dc
  5. rinse and repeat in the other datacenter.

Marking as UBN! as eventstreams can't be deployed right now and that seems serious enough to need a fix immediately.

Related Objects

Event Timeline

Joe triaged this task as Unbreak Now! priority.Nov 30 2022, 8:21 AM
Joe added projects: Kubernetes, Data-Engineering, SRE.
Joe updated the task description. (Show Details)
Joe added subscribers: Ottomata, gmodena.

Change 862222 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/deployment-charts@master] Remove chartid from matchlabels for eventstreams

Adding various Data Engineering planning and streaming tags for visibility.
Expediting into current sprint due to its being an unbreak now.

Change 862222 merged by jenkins-bot:

[operations/deployment-charts@master] Remove chartid from matchlabels for eventstreams

I have received guidance from @Clement_Goubert on the steps required to depool/destroy/deploy/repool eventstreams in each data centre: P41870

In codfw and eqiad we have canary releases though, so I can apply the change with additional steps based on adding --selector name=canary and --selector name=production to the helmfile commands.

I have merged the linked CR so I will begin this process now.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:11:04Z] <btullis> depooling codfw for eventstreams for T324074

New eventstreams clients in codfw have virtually stopped. Existing clients are draining slowly. I will wait until they have all gone.

image.png (462×1 px, 77 KB)

eventstreams in codfw has been completely drained.

image.png (462×1 px, 82 KB)

Proceeding to destroy and redeploy the service.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:32:14Z] <btullis> destroying the eventstreams deployment in codfw and reapplying for T324074

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:34:55Z] <btullis> repooling codfw for eventstreams for T324074

eventreams in codfw is now handling traffic again nicely.

image.png (462×1 px, 81 KB)

Proceeding to depool eqiad.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T11:59:20Z] <btullis> depooling eqiad for eventstreams for T324074

Depooled eqiad using the cookbook method, which was new to me.

btullis@cumin1001:~$ sudo cookbook sre.discovery.service-route depool eqiad eventstreams
START - Cookbook sre.discovery.service-route
TTL already set to 300, nothing to do
Setting pooled=False for tags: {'dnsdisc': '(eventstreams)', 'name': 'eqiad'}
Waiting 300.00 seconds for DNS changes to propagate
Expected routes:
eventstreams: codfw
Checking that eventstreams.discovery.wmnet records for eqiad matches eventstreams.svc.codfw.wmnet (
Checking that eventstreams.discovery.wmnet records for codfw matches eventstreams.svc.codfw.wmnet (
END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)

Now awaiing full draining of eventstreams in eqiad.

Mentioned in SAL (#wikimedia-analytics) [2022-11-30T12:29:53Z] <btullis> repooling eqiad for eventstreams for T324074

All looks good with eventstreams in eqiad again.

image.png (462×1 px, 71 KB)

Thank you both!

A pleasure.