Page MenuHomePhabricator

Final steps for fully-Kubernetes Thumbor
Closed, ResolvedPublic

Description

We have done the work required to get Thumbor running on kubernetes and it is now serving some production traffic. However, there are a few steps required before we can finalise the project:

  • Migrate Thumbor's memcached backend away from Thumbor bare metal servers T318695
  • Scale up capacity in eqiad

Currently we have scaled up enough to handle 100% of traffic in codfw on k8s. In the short term we should move towards handling 100% of traffic in codfw, but we are limited in this regard in eqiad. We can handle a 60/40 split k8s/metal there as of last testing. We should clarify and validate those numbers and then scale accordingly if capacity is available. If not, we should clearly escalate and highlight this fact to managers etc.

  • Decommission the existing thumbor hosts

Once we've got the capacity, we should decommission the old hosts as soon as is possible. Four of the eight servers are out of warranty.

  • clean up puppet classes/defines
  • Remove apt component
  • General performance improvements T333445

Find if there are areas we need to improve upon either on a per-format or general basis. Ideally bound this work within specific tasks so as to avoid taking too long or the ticket sprawling generally.

Event Timeline

Change 908501 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: make tmp-dir configurable, default disabled

https://gerrit.wikimedia.org/r/908501

Change 908501 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: make tmp-dir configurable, default disabled

https://gerrit.wikimedia.org/r/908501

Change 908549 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: set maxUnavailable to a higher number

https://gerrit.wikimedia.org/r/908549

Change 908549 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: set maxUnavailable to a higher number

https://gerrit.wikimedia.org/r/908549

Change 914737 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng: increase thumbor resource limits, eqiad replicas

https://gerrit.wikimedia.org/r/914737

Change 914737 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng: increase thumbor resource limits, eqiad replicas

https://gerrit.wikimedia.org/r/914737

Change 916506 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: haproxy timeout changes, block /metrics

https://gerrit.wikimedia.org/r/916506

Change 916506 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: haproxy timeout changes, block /metrics

https://gerrit.wikimedia.org/r/916506

Change 919148 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] thumbor: fix typo

https://gerrit.wikimedia.org/r/919148

Change 919148 merged by jenkins-bot:

[operations/deployment-charts@master] thumbor: fix typo

https://gerrit.wikimedia.org/r/919148

Change 919808 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/deployment-charts@master] admin_ng, thumbor: double memory limit for namespace and pods

https://gerrit.wikimedia.org/r/919808

Change 919808 merged by jenkins-bot:

[operations/deployment-charts@master] admin_ng, thumbor: double memory limit for namespace and pods

https://gerrit.wikimedia.org/r/919808

Change 946951 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] thumbor: remove thumbor server configuration

https://gerrit.wikimedia.org/r/946951

Change 946951 merged by Hnowlan:

[operations/puppet@production] thumbor: remove thumbor server configuration

https://gerrit.wikimedia.org/r/946951

Change 951545 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] service: move thumbor from thumbor pool to kubesvc

https://gerrit.wikimedia.org/r/951545

Change 951546 had a related patch set uploaded (by Hnowlan; author: Hnowlan):

[operations/puppet@production] conftool: clean up thumbor pools

https://gerrit.wikimedia.org/r/951546

hnowlan updated the task description. (Show Details)

Change 951545 merged by Hnowlan:

[operations/puppet@production] service: move thumbor from thumbor pool to kubesvc

https://gerrit.wikimedia.org/r/951545

Mentioned in SAL (#wikimedia-operations) [2024-02-07T12:31:25Z] <hnowlan@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-07T12:32:34Z] <hnowlan@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-07T12:34:56Z] <hnowlan@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-07T12:35:53Z] <hnowlan@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)

Change 951546 merged by Hnowlan:

[operations/puppet@production] conftool: clean up thumbor pools

https://gerrit.wikimedia.org/r/951546

Mentioned in SAL (#wikimedia-operations) [2024-02-08T10:38:26Z] <hnowlan@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-08T10:39:31Z] <hnowlan@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1020*,lvs2014*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-08T10:40:03Z] <hnowlan@cumin2002> START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)

Mentioned in SAL (#wikimedia-operations) [2024-02-08T10:41:11Z] <hnowlan@cumin2002> END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on P{lvs1019*,lvs2013*} and A:lvs (T334488)