Page MenuHomePhabricator

Request for custom nginx configuration, Wikidata primary sources tool VPS project
Closed, ResolvedPublic

Description

The project Web services need to keep connections alive using the nginx upstream and keepalive directives: http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive
I tested the following nginx configuration on a third-party server, which seems to work just fine.

  • File: /etc/nginx/sites-enabled/pst
  • sample content:
upstream ka {
    server localhost:9999; ###localhost must be replaced with the instance IP
    keepalive 32;
}

server {

    listen ###public IP here###;
    
    location /pst/curate {
        proxy_redirect off;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP  $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_pass_header Set-Cookie;
        proxy_pass http://ka/pst/curate;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
    }
}
  1. I've been digging for specific documentation on how to achieve this on VPS projects, with no success;
  2. I also blindly tried to use the puppet class profile::dumps::web::nginx, with no success:
    • the class makes /etc/nginx available;
    • added /etc/nginx/sites-enabled/pstwith the above configuration;
    • replaced localhost with 10.68.22.221 (i.e., the instance IP);
    • sudo nginx -s reload.

Any guidance on this is super appreciated, and I'll gladly create a documentation page on wikitech afterwards.

Event Timeline

@Hjfocs, the domain proxy for Cloud VPS instances is a shared nginx server which uses a custom lookup function to map requests from the client to the appropriate backend VPS instance and port based on the inbound request's Host header. This setup does not use upstream blocks to describe the backends. Functionally this means that the solution you are describing is not possible when using the domain proxy.

Can you provide more information about the problem you are hoping to solve with HTTP keepalive between the proxy and your backend server?

Thanks for your reply, @bd808 .

Premise: the back end is implemented in Java.

TL;DR: The back-end services communicate with the storage engine through HTTP.

Example workflow:

  1. the client fires a POST to https://pst.wmflabs.org/pst/curate;
  2. the back-end service processes the client request;
  3. the service fires an internal HTTP request (i.e., via http://10.68.22.221:9999) to the storage engine, in order to update some data;
  4. the storage engine responds to the service;
  5. the service sends the final response to the client.

The problem: I'm unpredictably getting org.apache.http.NoHttpResponseException: 10.68.22.221:9999 failed to respond. Step 4 fails.

This seems to be raised by HTTP connections that get closed between step 1 and 3.
Sending a Connection: Keep-Alive; header in step 3 doesn't fix the problem, while upstream + keepalive works.

Example workflow:

  1. the client fires a POST to https://pst.wmflabs.org/pst/curate;
  2. the back-end service processes the client request;
  3. the service fires an internal HTTP request (i.e., via http://10.68.22.221:9999) to the storage engine, in order to update some data;
  4. the storage engine responds to the service;
  5. the service sends the final response to the client.

Looking at the domain proxy setup, this translates to:

  1. the client fires a POST to https://pst.wmflabs.org/pst/curate;
    1. client connects to domain proxy nginx instance via public IP (208.80.155.156)
    2. Request includes Host: pst.wmflabs.org
    3. nginx on domain proxy server (at this moment, novaproxy-01.project-proxy.eqiad.wmflabs) reads HTTP payload to determine backend
    4. nginx on domain proxy server constructs a proxied HTTP POST request to http://10.68.22.221:9999
      1. This ip and port are selected because "pst.wmflabs.org" has been registered via Horizon to use this backing server and port.
    5. 10.68.22.221 is routed to the pst.wikidata-primary-sources-tool.eqiad.wmflabs VM
  2. the back-end service processes the client request;
    1. Blazegraph is running on port 9999 on pst.wikidata-primary-sources-tool.eqiad.wmflabs
    2. Blazegraph receives the POST
  3. the service fires an internal HTTP request (i.e., via http://10.68.22.221:9999) to the storage engine, in order to update some data;
    1. My knowledge of the stack stops here, but it sounds like Blazegraph talks to itself via HTTP?
  4. the storage engine responds to the service;
    1. Again, outside my knowledge of the deep stack activties
  5. the service sends the final response to the client.
    1. Blazegraph on pst.wikidata-primary-sources-tool.eqiad.wmflabs responds to the request from nginx on novaproxy-01.project-proxy.eqiad.wmflabs
    2. Nginx on novaproxy-01.project-proxy.eqiad.wmflabs responds to the client with the HTTP payload it received

The problem: I'm unpredictably getting org.apache.http.NoHttpResponseException: 10.68.22.221:9999 failed to respond. Step 4 fails.

This seems to be raised by HTTP connections that get closed between step 1 and 3.
Sending a Connection: Keep-Alive; header in step 3 doesn't fix the problem, while upstream + keepalive works.

HTTP Keep-Alive would be a client connection pooling optimization for Blazegraph talking to Blazegraph. There is no nginx proxy intervening in this conversation unless Blazegraph is actually talking to itself via the proxied hostname (pst.wmflabs.org) rather than direct communication using the 10.68.22.221:9999 ip and port. If you are using the public hostname instead of the direct ip or internal hostname, this seems potentially problematic just from the point of view of additional and unnecessary latency and routing complication. The only way that I see an nginx instance would be involved is if you are using the public hostname of the proxy instead of directly talking to the internal service via an ip or internal hostname.

Is there a part of this stack that I am misunderstanding, or is your initial inbound request actually to some other end point?

  1. My knowledge of the stack stops here, but it sounds like Blazegraph talks to itself via HTTP?

Exactly.

HTTP Keep-Alive would be a client connection pooling optimization for Blazegraph talking to Blazegraph.

I agree it should. I tried with the Connection header, will investigate further.

There is no nginx proxy intervening in this conversation unless Blazegraph is actually talking to itself via the proxied hostname (pst.wmflabs.org) rather than direct communication using the 10.68.22.221:9999 ip and port.

True, it indeed uses direct communication.

The only way that I see an nginx instance would be involved is if you are using the public hostname of the proxy instead of directly talking to the internal service via an ip or internal hostname.

I totally agree. For some reason that is out of my knowledge, the upstream nginx directive solved the problem. That's why I opened this ticket.

Is there a part of this stack that I am misunderstanding

No, I think you grasped the picture.

is your initial inbound request actually to some other end point?

No.

Anyway, I really appreciate your analysis.
Will get back to you once I have terminated my investigation.

Hjfocs claimed this task.

Premise: the primary sources tool back end is being implemented as a Wikidata query service module.

After some digging, I found that the Wikidata query service uses an old version (4.4) of the Apache HTTP client, which made the primary sources tool hit this bug:
https://issues.apache.org/jira/browse/HTTPCLIENT-1609
The nginx upstream and keepalive directives are a workaround for the bug.

Version 4.4.1 fixed the bug:
https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.4.x.txt

Upgrading to 4.4.1 seems to have resolved the issue.