Change Details

Analyze surge of traffic in aqs that lead to 504sOn 2018-03-20 from 11:07 to 11:10 UTC icinga started complaining about too many 5xx returned for the domain wikimedia.org. It turned out to be related to the v1/pageviews API. AQS metrics containing the spike: https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543209228&to=1521544844913 From a early Analytics made by Joseph checking Webrequest data (so from the Varnish viewpoint), it seems that multiple IPs using a python-requests UA made 192125 200 request and 188177 404 requests at minute 11:09 (~380k reqs/min, 6338/s). From the metrics it seems that Restbase at some point started returning 504s (gateway timeout), failing with the following error: https://logstash.wikimedia.org/goto/c8f93229a85d7d961df699459347cf6e ``` Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.0.223:0) at Object.exports._errnoException (util.js:1018:11) at exports._exceptionWithHostPort (util.js:1041:20) at connect (net.js:880:16) at net.js:1009:7 at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/index.js:80:28 at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/cache.js:116:13 at RawTask.call (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/asap.js:40:19) at flush (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/raw.js:50:29) at _combinedTickCallback (internal/process/next_tick.js:73:7) at process._tickCallback (internal/process/next_tick.js:104:9) ``` Since I am not seeing a huge spike in latency metrics for AQS, I am wondering if all these requests didn't saturate the Restbase's ephemeral ports for a brief amount of time causing this nodejs error. The alternative is that the AQS backend was overwhelmed by requests, being slow to process piling up new ones, causing the 504s.