On 2018-03-20 from 11:07 to 11:10 UTC icinga started complaining about too many 5xx returned for the domain wikimedia.org. It turned out to be related to the v1/pageviews API.
AQS metrics containing the spike:
https://grafana.wikimedia.org/dashboard/db/aqs-elukey?orgId=1&from=1521543209228&to=1521544844913
From a early Analytics made by Joseph checking Webrequest data (so from the Varnish viewpoint), it seems that multiple IPs using a python-requests UA made 192125 200 request and 188177 404 requests at minute 11:09 (~380k reqs/min, 6338/s).
From the metrics it seems that Restbase at some point started returning 504s (gateway timeout), failing with the following error:
https://logstash.wikimedia.org/goto/c8f93229a85d7d961df699459347cf6e
```
Error: connect EADDRNOTAVAIL 10.2.2.12:7232 - Local (10.64.0.223:0)
at Object.exports._errnoException (util.js:1018:11)
at exports._exceptionWithHostPort (util.js:1041:20)
at connect (net.js:880:16)
at net.js:1009:7
at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/index.js:80:28
at /srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/dnscache/lib/cache.js:116:13
at RawTask.call (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/asap.js:40:19)
at flush (/srv/deployment/restbase/deploy-cache/revs/8dbc93c6b3747dbfe90f8f8e56fd6af661cf5e69/node_modules/asap/raw.js:50:29)
at _combinedTickCallback (internal/process/next_tick.js:73:7)
at process._tickCallback (internal/process/next_tick.js:104:9)
```
Since I am not seeing a huge spike in latency metrics for AQS, I am wondering if all these requests didn't saturate the Restbase's ephemeral ports for a brief amount of time causing this nodejs error.
The alternative is that the AQS backend was overwhelmed by requests, being slow to process piling up new ones, causing the 504s.