Page MenuHomePhabricator

ATS Backends: Test live cache_text traffic
Closed, ResolvedPublic

Description

Begin testing a small fraction of live cache_text traffic through ATS backends.

Identify and fix any functional issues observed.

In order not to be blocked by lack of TLS support at the appserver layer (see T210411) we can begin experimenting in eqiad to avoid cross-DC traffic.

This is a 2019-20_Q1 Traffic goal.

Event Timeline

ema created this task.Jul 22 2019, 9:37 AM
Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 22 2019, 9:37 AM
ema triaged this task as Normal priority.Jul 22 2019, 9:37 AM
ema moved this task from Triage to Caching on the Traffic board.

Change 531896 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: reimage cp1075 as text_ats

https://gerrit.wikimedia.org/r/531896

Mentioned in SAL (#wikimedia-operations) [2019-08-27T08:18:39Z] <ema> depool cp1075 and reimage as text_ats T228629

Change 531896 merged by Ema:
[operations/puppet@production] cache: reimage cp1075 as text_ats

https://gerrit.wikimedia.org/r/531896

Script wmf-auto-reimage was launched by ema on cumin1001.eqiad.wmnet for hosts:

['cp1075.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201908270828_ema_81110.log.

Change 532561 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: convert cp1075 to text_ats (hiera/conftool)

https://gerrit.wikimedia.org/r/532561

Change 532561 merged by Ema:
[operations/puppet@production] cache: convert cp1075 to text_ats (hiera/conftool)

https://gerrit.wikimedia.org/r/532561

Completed auto-reimage of hosts:

['cp1075.eqiad.wmnet']

and were ALL successful.

Change 532644 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache: ATS storage configuration for cp1075

https://gerrit.wikimedia.org/r/532644

Change 532644 merged by Ema:
[operations/puppet@production] cache: ATS storage configuration for cp1075

https://gerrit.wikimedia.org/r/532644

Mentioned in SAL (#wikimedia-operations) [2019-08-27T12:15:19Z] <ema> pool cp1075 w/ ATS backend T228629

Change 532700 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] cache_text eqiad: read ats-be etcd keys

https://gerrit.wikimedia.org/r/532700

Change 532700 merged by Ema:
[operations/puppet@production] cache_text eqiad: read ats-be etcd keys

https://gerrit.wikimedia.org/r/532700

Mentioned in SAL (#wikimedia-operations) [2019-08-27T12:36:04Z] <ema> pool cp1075 w/ ATS backend (for real) T228629

Change 532953 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: temporarily use plain HTTP to access docker-registry

https://gerrit.wikimedia.org/r/532953

Change 532953 merged by Ema:
[operations/puppet@production] ATS: temporarily use plain HTTP to access docker-registry

https://gerrit.wikimedia.org/r/532953

Mentioned in SAL (#wikimedia-operations) [2019-08-28T13:59:13Z] <ema> cp1075 ats-be repooled to resume testing T228629

Change 533041 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: temporarily use plain HTTP to access docker-registry"

https://gerrit.wikimedia.org/r/533041

Change 533041 merged by Ema:
[operations/puppet@production] Revert "ATS: temporarily use plain HTTP to access docker-registry"

https://gerrit.wikimedia.org/r/533041

Change 533242 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] ATS: perform MW and RB mangling after cache lookup

https://gerrit.wikimedia.org/r/533242

Change 533242 merged by Ema:
[operations/puppet@production] ATS: perform MW and RB mangling after cache lookup

https://gerrit.wikimedia.org/r/533242

Mentioned in SAL (#wikimedia-operations) [2019-08-30T13:19:27Z] <ema> cp1075: pause ats-be testing during the weekend T228629

Mentioned in SAL (#wikimedia-operations) [2019-09-03T08:49:07Z] <ema> cp1075: pool ats-be with caching enabled T228629

ema closed this task as Resolved.Mon, Sep 9, 7:47 AM

cp1075 has been serving live production traffic for several days now, we can consider the test successful.