Page MenuHomePhabricator

Timeout while retrieving the catalog from the Docker Registry
Closed, ResolvedPublic

Description

The docker-report timers have been failing for a while due to timeouts while retrieving the catalog (/v2/_catalog) from the Docker Registry.

The workflow is the following:

  1. /v2/_catalog is called multiple times, since it returns paginated results (max 50 at the moment). The HTTP Response contains a Link header that specifies the next paginated call to make. The goal is to get all the repositories, to then inspect their tags.
  1. After the catalog is fully retrieved, we apply our filters to decide what images to inspect and run/send reports on.

Docker report is currently using the CDN endpoint of the docker registry, so there are the following timeouts:

ATS TTFB: 180s

Nginx (on registryXXXX nodes): proxy TTFB 300s

From my tests it seems that retrieving the paginated catalog is taking on average 1 minute for each request/response, except for some times when a 504 is returned by Nginx (5 mins elapsed). It takes some calls to hit the slowness, but eventually I can clearly see something like this in the logs:

http.request.method=GET http.request.remoteaddr=127.0.0.1 http.request.uri="/v2/_catalog?last=wikimedia%2Fmediawiki-services-example-node-api&n=50" http.request.useragent=curl/7.88.1 http.response.contenttype="application/json; charset=utf-8" http.response.duration=6m0.803912984s http.response.status=200 http.response.written=1793

Note the http.response.duration of 6 mins :(

The slowness doesn't happen for specific calls only, once you hit the bottleneck then all subsequent HTTP calls take minutes to complete. Then something happens, and the slowness disappears.

I still have no idea where/how the catalog is stored/fetched, but I guess that it is the culprit of this all problem.

Event Timeline

I tried to disable the Redis cache blobdescriptor setting inmemory for eqiad registry nodes, and I didn't hit the timeout problem while fetching the catalog, Not entirely sure what's happening, but the bottleneck may be the Redis cache and/or its state.

Current setup:

  • registry100* hosts using inmemory blobdescriptor cache
  • registry200* hosts using redis blobdescription cache

The docker-report tool doesn't timeout anymore, and I see way better timings for each paginated request (2-4x time when redis is involved, for example 26s vs 1.25 minutes).

I've also reverted inmemory to redis caching on registry1004 (eqiad, not-active/depooled) to make sure that it wasn't traffic-related, and indeed I got the same longer latency as in codfw.

It is interesting that the replies are the same, namely the contain the same JSON response (so the repositories are not fetched from Redis apparently, but from Swift).

Change #1077933 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: add nginx monitoring

https://gerrit.wikimedia.org/r/1077933

Revamped a bit https://grafana-rw.wikimedia.org/d/StcefURWz/docker-registry, and I noticed that Swift P50 latency for Stat operations goes from ~100ms to 1.5s when the catalog is fetched.

Going to run the following tests:

  1. Run docker-report using the registry's eqiad svc endpoint to see if the same problem is reproducible. I'd say yes, but better to be safe since I am testing on the depooled DC (eqiad).
  2. If the bottleneck is confirmed, disable redis caching on both registry100X hosts and redo the test.

Change #1078380 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] swift: avoid rate-limit for the Docker account

https://gerrit.wikimedia.org/r/1078380

Next steps:

Change #1078380 merged by Elukey:

[operations/puppet@production] swift: avoid rate-limit for the Docker account

https://gerrit.wikimedia.org/r/1078380

Rolled out the swift proxy change, it seems that it has a solved the issue. Doing more tests before closing to confirm.

elukey claimed this task.

The issue seems solved, I tested docker-report multiple times and I didn't see any issue (even going through the CDN that has a lower TTFB timeout). Tentatively closing :)

Change #1077933 abandoned by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: add nginx monitoring

https://gerrit.wikimedia.org/r/1077933