Timeout while retrieving the catalog from the Docker Registry
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	elukey
	Oct 2 2024, 2:46 PM

Description

The docker-report timers have been failing for a while due to timeouts while retrieving the catalog (/v2/_catalog) from the Docker Registry.

The workflow is the following:

/v2/_catalog is called multiple times, since it returns paginated results (max 50 at the moment). The HTTP Response contains a Link header that specifies the next paginated call to make. The goal is to get all the repositories, to then inspect their tags.

After the catalog is fully retrieved, we apply our filters to decide what images to inspect and run/send reports on.

Docker report is currently using the CDN endpoint of the docker registry, so there are the following timeouts:

ATS TTFB: 180s

Nginx (on registryXXXX nodes): proxy TTFB 300s

From my tests it seems that retrieving the paginated catalog is taking on average 1 minute for each request/response, except for some times when a 504 is returned by Nginx (5 mins elapsed). It takes some calls to hit the slowness, but eventually I can clearly see something like this in the logs:

http.request.method=GET http.request.remoteaddr=127.0.0.1 http.request.uri="/v2/_catalog?last=wikimedia%2Fmediawiki-services-example-node-api&n=50" http.request.useragent=curl/7.88.1 http.response.contenttype="application/json; charset=utf-8" http.response.duration=6m0.803912984s http.response.status=200 http.response.written=1793

Note the http.response.duration of 6 mins :(

The slowness doesn't happen for specific calls only, once you hit the bottleneck then all subsequent HTTP calls take minutes to complete. Then something happens, and the slowness disappears.

I still have no idea where/how the catalog is stored/fetched, but I guess that it is the culprit of this all problem.

Details

	Subject	Repo	Branch	Lines +/-
	role::docker_registry_ha::registry: add nginx monitoring	operations/puppet	production	+9 -1
	swift: avoid rate-limit for the Docker account	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T242604 Remove obsoleted docker images
Declined	None	T375645 Clean up the Docker Registry catalog and Swift storage from old images
Resolved	elukey	T376285 Timeout while retrieving the catalog from the Docker Registry

Event Timeline

elukey created this task.Oct 2 2024, 2:46 PM

bking subscribed.Oct 2 2024, 2:50 PM

I tried to disable the Redis cache blobdescriptor setting inmemory for eqiad registry nodes, and I didn't hit the timeout problem while fetching the catalog, Not entirely sure what's happening, but the bottleneck may be the Redis cache and/or its state.

Current setup:

registry100* hosts using inmemory blobdescriptor cache
registry200* hosts using redis blobdescription cache

The docker-report tool doesn't timeout anymore, and I see way better timings for each paginated request (2-4x time when redis is involved, for example 26s vs 1.25 minutes).

I've also reverted inmemory to redis caching on registry1004 (eqiad, not-active/depooled) to make sure that it wasn't traffic-related, and indeed I got the same longer latency as in codfw.

It is interesting that the replies are the same, namely the contain the same JSON response (so the repositories are not fetched from Redis apparently, but from Swift).

Change #1077933 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] role::docker_registry_ha::registry: add nginx monitoring

https://gerrit.wikimedia.org/r/1077933

gerritbot added a project: Patch-For-Review.Oct 4 2024, 10:31 AM

Revamped a bit https://grafana-rw.wikimedia.org/d/StcefURWz/docker-registry, and I noticed that Swift P50 latency for Stat operations goes from ~100ms to 1.5s when the catalog is fetched.

Going to run the following tests:

Run docker-report using the registry's eqiad svc endpoint to see if the same problem is reproducible. I'd say yes, but better to be safe since I am testing on the depooled DC (eqiad).
If the bottleneck is confirmed, disable redis caching on both registry100X hosts and redo the test.

Change #1078380 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/puppet@production] swift: avoid rate-limit for the Docker account

https://gerrit.wikimedia.org/r/1078380

Next steps:

Apply https://gerrit.wikimedia.org/r/1078380 during tomorrow's MW Maintenance Window and retest. Hopefully the ratelimit is the culprit :)

Change #1078380 merged by Elukey:

[operations/puppet@production] swift: avoid rate-limit for the Docker account

https://gerrit.wikimedia.org/r/1078380

Rolled out the swift proxy change, it seems that it has a solved the issue. Doing more tests before closing to confirm.

elukey moved this task from Backlog to In Progress on the User-Elukey board.Oct 9 2024, 11:13 AM

Clement_Goubert awarded a token.Oct 9 2024, 3:38 PM

The issue seems solved, I tested docker-report multiple times and I didn't see any issue (even going through the CDN that has a lower TTFB timeout). Tentatively closing :)

Change #1077933 abandoned by Elukey:

[operations/puppet@production] role::docker_registry_ha::registry: add nginx monitoring

https://gerrit.wikimedia.org/r/1077933

Maintenance_bot removed a project: Patch-For-Review.Oct 14 2024, 4:30 PM

elukey mentioned this in T348876: Container image reports in debmonitor are broken.Oct 17 2024, 3:08 PM