The docker-report timers have been failing for a while due to timeouts while retrieving the catalog (/v2/_catalog) from the Docker Registry.
The workflow is the following:
- /v2/_catalog is called multiple times, since it returns paginated results (max 50 at the moment). The HTTP Response contains a Link header that specifies the next paginated call to make. The goal is to get all the repositories, to then inspect their tags.
- After the catalog is fully retrieved, we apply our filters to decide what images to inspect and run/send reports on.
Docker report is currently using the CDN endpoint of the docker registry, so there are the following timeouts:
ATS TTFB: 180s
Nginx (on registryXXXX nodes): proxy TTFB 300s
From my tests it seems that retrieving the paginated catalog is taking on average 1 minute for each request/response, except for some times when a 504 is returned by Nginx (5 mins elapsed). It takes some calls to hit the slowness, but eventually I can clearly see something like this in the logs:
http.request.method=GET http.request.remoteaddr=127.0.0.1 http.request.uri="/v2/_catalog?last=wikimedia%2Fmediawiki-services-example-node-api&n=50" http.request.useragent=curl/7.88.1 http.response.contenttype="application/json; charset=utf-8" http.response.duration=6m0.803912984s http.response.status=200 http.response.written=1793
Note the http.response.duration of 6 mins :(
The slowness doesn't happen for specific calls only, once you hit the bottleneck then all subsequent HTTP calls take minutes to complete. Then something happens, and the slowness disappears.
I still have no idea where/how the catalog is stored/fetched, but I guess that it is the culprit of this all problem.