Page MenuHomePhabricator

Make main api s3 client reuse connection and do tls resumption
Closed, ResolvedPublic1 Estimated Story Points

Description

Screenshot 2025-08-26 at 12.34.39.png (834×3 px, 500 KB)

The recent on-demand-api performance tests revealed high CPU time by net/http tls handshakes. And, while the tests were running it was found that a single API instance was establishing more than 50k connections to s3 which clearly depicts this issue
This is owing to the fact that the existing s3.go library uses the default aws sdk v1 config and the aws sdk uses a default http client with default configuration values

func New(env *env.Environment) s3iface.S3API {
	cfg := &aws.Config{
		Region: aws.String(env.AWSRegion),
	}

The default transport:

  • Does keep-alive, but
  • Has no TLS session resumption (ClientSessionCache is nil),
  • Very conservative idle connection settings.

Detailed settings of the default http client

Idle connections:
MaxIdleConns: 100
MaxIdleConnsPerHost: 2
* At most 2 idle keep-alive connections per S3 endpoint.
   With concurrency >2, new TLS handshakes happen all the time.

Idle timeout:
IdleConnTimeout: 90 seconds
* idle conns are closed after 90s.

TLS:
TLSHandshakeTimeout: 10s
TLSClientConfig: nil (so no ClientSessionCache)

In case of on-demand api, the NewGetLargeEntities fans out multiple parallel GetObjectWithContext calls, some requests reuse connections, but many still renegotiate full TLS handshakes.

Hence the recommendation is to use the custom http client with the below settings. The NewLRUClientSessionCache ensures tls resumption

func New(env *env.Environment) s3iface.S3API {
    cfg := &aws.Config{
        Region: aws.String(env.AWSRegion),
        HTTPClient: &http.Client{
            Transport: &http.Transport{
                TLSClientConfig: &tls.Config{
                    MinVersion:         tls.VersionTLS12,
                    ClientSessionCache: tls.NewLRUClientSessionCache(128),
                },
                MaxIdleConns:        100,
                MaxIdleConnsPerHost: 100,
                IdleConnTimeout:     90 * time.Second,
            },
        },
    }

This can be validated by running the current perf tests and inspecting the continuous profiler graph for TLS handshake activity.

Event Timeline

RThomas-WMF updated the task description. (Show Details)
RThomas-WMF renamed this task from Make main api s3 client reuse connection and tls resumption to Make main api s3 client reuse connection and do tls resumption.Aug 26 2025, 11:57 AM
RThomas-WMF updated the task description. (Show Details)
RThomas-WMF updated the task description. (Show Details)
RThomas-WMF updated the task description. (Show Details)
RThomas-WMF updated the task description. (Show Details)
HShaikh set the point value for this task to 3.Sep 3 2025, 1:14 PM
RThomas-WMF changed the task status from Open to In Progress.Sep 8 2025, 10:16 AM
RThomas-WMF claimed this task.
RThomas-WMF changed the point value for this task from 3 to 1.
RThomas-WMF moved this task from Next Up to In Progress on the Wikimedia Enterprise (Sprint 81) board.