Page MenuHomePhabricator

Analyze surge of traffic in AQS that lead to 504s
Closed, ResolvedPublic13 Estimated Story Points

Description

On 2018-03-20 from 11:07 to 11:10 UTC icinga started complaining about too many 5xx returned for the domain wikimedia.org. It turned out to be related to the v1/pageviews API.

AQS metrics containing the spike:

https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543209228&to=1521544844913

From a early analysis made by Joseph checking Webrequest data (so from the Varnish viewpoint), it seems that multiple IPs using a python-requests UA made 192125 HTTP 200 request and 188177 HTTP 404 requests at minute 11:09 (~380k reqs/min, 6338/s).

From the metrics it seems that Restbase at some point started returning 504s (gateway timeout), failing with the following error:

https://logstash.wikimedia.org/goto/c8f93229a85d7d961df699459347cf6e

Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.0.223:0)
    at Object.exports._errnoException (util.js:1018:11)
    at exports._exceptionWithHostPort (util.js:1041:20)
    at connect (net.js:880:16)
    at net.js:1009:7
    at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/index.js:80:28
    at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/cache.js:116:13
    at RawTask.call (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/asap.js:40:19)
    at flush (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/raw.js:50:29)
    at _combinedTickCallback (internal/process/next_tick.js:73:7)
    at process._tickCallback (internal/process/next_tick.js:104:9)

Since I am not seeing a huge spike in latency metrics for AQS, I am wondering if all these requests didn't saturate the Restbase's ephemeral ports for a brief amount of time causing this nodejs error.

On restbase1007:

elukey@restbase1007:~$ cat /proc/sys/net/ipv4/ip_local_port_range
32768	60999

That is ~28k ephemeral ports available. From this graph it seems that time waits increased lot during the timeframe:

https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&from=1521541876143&to=1521547671625&panelId=13&fullscreen&var-server=restbase1008&var-datasource=eqiad%20prometheus%2Fops

From https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux

"If the TIME-WAIT sockets are on the client side, such a situation is easy to detect. The call to connect() will return EADDRNOTAVAIL and the application will log some error message about that."

and

elukey@restbase1007:~$ sudo sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 0

"By enabling net.ipv4.tcp_tw_reuse, Linux will reuse an existing connection in the TIME-WAIT state for a new outgoing connection if the new timestamp is strictly bigger than the most recent timestamp recorded for the previous connection: an outgoing connection in the TIME-WAIT state can be reused after just one second."

We are using this option on cache and swift hosts IIRC. So if everything that I've written vaguely makes sense, I think that the issue was on Restbase, even if AQS needs some tweaks to be quicker for 404s etc...

Would it make sense to increase on Restbase the ephemeral ports available and/or enable tw_reuse?

Awesome explanations for those settings on caching hosts in https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/cacheproxy/manifests/performance.pp

Event Timeline

Thus far we can see an increase of 404s that lead to 504s

In order to cache 404s in Varnish for a short period of time we'd need to add some config setting to hyperswitch and use that setting in AQS. RESTBase will just proxy whatever cache-control AQS will set.

mobrovac renamed this task from Analyze surge of traffiq in aqs that lead to 504s to Analyze surge of traffic in AQS that lead to 504s .Mar 20 2018, 9:53 PM
mobrovac added a project: RESTBase-API.
mobrovac updated the task description. (Show Details)

I think we need to investigate the effects of 404 in api, caching those for a short time will not stop the effects of a storm of 404s but it will prevent re-running queries we know are going to return 404 if data is -for example- delayed.

elukey changed the visibility from "Public (No Login Required)" to "Custom Policy".Mar 21 2018, 9:12 AM
elukey updated the task description. (Show Details)

Great analysis @elukey !

Would it make sense to increase on Restbase the ephemeral ports available and/or enable tw_reuse?

Yes, we should definitely do that. I am not sure how these two (especially the latter) escaped us earlier.

TCP tw-reuse has been rolled out to all the restbase hosts via https://gerrit.wikimedia.org/r/#/c/421901. Next week we'll enable connection re-use between rb <-> aqs as test.

https://github.com/wikimedia/restbase/pull/984 was created to enable TCP conn reuse between rb and aqs. Once enabled and tested that nothing is on fire, we can close the task.

After a chat with @JAllemandou, we agreed that it would be best also to apply the same config from AQS to Druid, so hyperswitch will try to re-use conns on every "level" (rb -> aqs -> druid).

After a chat with @JAllemandou, we agreed that it would be best also to apply the same config from AQS to Druid, so hyperswitch will try to re-use conns on every "level" (rb -> aqs -> druid).

+1. Note that for that you will need the latest version of preq.

@JAllemandou whenever you have time, let's discuss about this :)

mforns set the point value for this task to 13.May 7 2018, 3:40 PM
mforns moved this task from In Progress to Paused on the Analytics-Kanban board.

After a chat with the Services team, preq needs to contain https://github.com/wikimedia/preq/pull/31, so as far as I can see 0.5.4+ (we are currently running 0.5.2).

After some patches we ended up with:

Merged and deployed in AQS labs. I configured the environment to use a "fake" localhost Druid cluster, and created a simple python http webserver to log POST queries. This is what I registered via tcpdump for the swagger edit tests:

POST /druid/v2/ HTTP/1.1
content-type: application/json
x-request-id: 983418b4-6ee4-11e8-a969-9f1bc9f00305
user-agent: HyperSwitch
x-client-ip: ::ffff:127.0.0.1
host: localhost:8082
content-length: 595
Connection: keep-alive

I did the same test on aqs1004, this time setting tcpdump to get all traffic to port 8082:

POST /druid/v2/ HTTP/1.1
content-type: application/json
x-request-id: 4c54dc04-585e-11e8-bfc7-b7b8a8526ccf
user-agent: HyperSwitch
x-client-ip: ::ffff:10.64.0.107
host: druid-public-broker.svc.eqiad.wmnet:8082
content-length: 595
Connection: keep-alive

So the keep alive seems already working! Is it a default setting for HyperSwitch ?

elukey triaged this task as Medium priority.Jun 13 2018, 8:49 AM
elukey changed the visibility from "Custom Policy" to "Public (No Login Required)".

After a chat with Services we discovered that the request module sets keep alive by default for HTTP connections :)

Mentioned in SAL (#wikimedia-operations) [2018-06-13T12:57:28Z] <elukey@deploy1001> Started deploy [analytics/aqs/deploy@84fab89]: Update AQS for T190213

Mentioned in SAL (#wikimedia-operations) [2018-06-13T13:00:06Z] <elukey@deploy1001> Finished deploy [analytics/aqs/deploy@84fab89]: Update AQS for T190213 (duration: 02m 38s)

Change 440120 had a related patch set uploaded (by Elukey; owner: Elukey):
[analytics/aqs/deploy@master] Remove old/redundant git_server config

https://gerrit.wikimedia.org/r/440120

Change 440120 merged by Elukey:
[analytics/aqs/deploy@master] Remove old/redundant git_server config

https://gerrit.wikimedia.org/r/440120