Page MenuHomePhabricator

Amazon intermittent timeouts 2021-12-10
Closed, ResolvedPublic

Description

Making a task since this is still happening. Some IPNS are running into timeouts but not all, donations are being completed.

Starting Friday afternoon US time (I'll hunt down the exact time) we started getting failmail saying:

2021-12-11T20:01:21+00:00 [INFO] Starting processing for request, configuration view: 'amazon', action: 'listener'
2021-12-11T20:01:21+00:00 [INFO] Starting processing of listener request from 72.21.217.44
2021-12-11T20:01:21+00:00 [INFO] No IP whitelist specified. Continuing and not validating remote IP '72.21.217.44'.
2021-12-11T20:01:21+00:00 [INFO] (RawData) Incoming message (raw)
2021-12-11T20:01:52+00:00 [ALERT] Error validating Amazon message. Firewall problem?

Looking at more of the logs the full error message is:
Dec 11 00:03:55 frpig1001 SmashPig: SPCID-0851076298 | Error validating Amazon message. Firewall problem? | |
Dec 11 00:03:55 frpig1001 SmashPig: SPCID-0851076298 | Message denied by security policy, death is me. | #0 /srv/www/org/wikimedia/listeners/SmashPig/Core/Listeners/RestListener.php(16
): SmashPig\PaymentProviders\Amazon\AmazonListener->parseEnvelope(Object(SmashPig\Core\Http\Request))
Dec 11 00:03:55 frpig1001 SmashPig: #1 /srv/www/org/wikimedia/listeners/SmashPig/Core/Http/RequestHandler.php(101): SmashPig\Core\Listeners\RestListener->execute(Object(SmashPig\Core\H
ttp\Request), Object(SmashPig\Core\Http\Response))
Dec 11 00:03:55 frpig1001 SmashPig: #2 /srv/www/org/wikimedia/listeners/SmashPig/PublicHttp/smashpig_http_handler.php(13): SmashPig\Core\Http\RequestHandler::process()
Dec 11 00:03:55 frpig1001 SmashPig: #3 {main} | SmashPig\Core\Listeners\ListenerSecurityException@/srv/www/org/wikimedia/listeners/SmashPig/PaymentProviders/Amazon/AmazonListener.php:6
4 (Unable to post request, underlying exception of Failed to connect to sns.us-east-1.amazonaws.com port 443: Connection timed out Debug logging: * Expire in 0 ms for 6 (transfer 0x557
8f82bd250)
Dec 11 00:03:55 frpig1001 SmashPig: * Added mws.amazonservices.com:443:34.233.223.241 to DNS cache
Dec 11 00:03:55 frpig1001 SmashPig: * Expire in 1 ms for 1 (transfer 0x5578f82bd250)
(the above line repeated until)
Dec 11 00:03:55 frpig1001 SmashPig: * Trying 209.54.182.11...
Dec 11 00:03:55 frpig1001 SmashPig: * TCP_NODELAY set
Dec 11 00:03:55 frpig1001 SmashPig: * Expire in 200 ms for 4 (transfer 0x5578f82bd250)
Dec 11 00:03:55 frpig1001 SmashPig: * connect to 209.54.182.11 port 443 failed: Connection timed out
Dec 11 00:03:55 frpig1001 SmashPig: * Failed to connect to sns.us-east-1.amazonaws.com port 443: Connection timed out
Dec 11 00:03:55 frpig1001 SmashPig: * Closing connection 0
Dec 11 00:03:55 frpig1001 SmashPig: )
Dec 11 00:03:55 frpig1001 SmashPig: SPCID-0851076298 | Finished processing listener request | |

The IPN messages that this error happens on are resent and are processed correctly.

Event Timeline

We need to update the firewall to allow outbound access from frpig1001 to all the IPs that sns.us-east-1 can resolve to. We have a tool to calculate that: https://phabricator.wikimedia.org/diffusion/WFTO/browse/master/amazon-allowlist/amazon-ranges.sh

Here's the current output:

3.3.16.0/21
3.240.0.0/13
13.34.4.64/26
13.34.8.64/26
13.34.12.192/27
13.34.17.64/26
13.34.29.64/26
13.34.29.128/25
13.34.30.0/24
13.34.31.0/25
13.34.31.192/26
13.34.34.192/26
13.34.40.0/26
13.34.42.0/25
13.34.43.192/26
13.34.47.128/26
13.34.50.128/25
13.34.51.0/26
13.34.55.192/26
13.34.56.0/25
13.34.57.64/26
13.34.57.128/25
13.34.59.64/26
13.34.59.128/26
13.34.60.0/24
13.34.61.0/26
13.34.63.128/26
15.221.0.0/24
15.221.4.0/23
15.221.24.0/21
15.230.14.12/32
15.230.14.18/31
15.230.14.252/31
15.230.18.0/24
15.230.35.0/24
15.230.38.0/24
15.230.40.0/24
15.230.56.0/23
15.230.65.128/25
15.230.66.0/24
15.230.83.0/24
15.230.87.0/24
15.230.130.0/24
15.230.137.0/24
15.230.142.0/24
15.230.145.0/24
15.230.148.0/24
15.230.157.0/24
15.230.162.0/24
15.230.184.0/24
15.230.192.0/23
15.230.201.0/24
15.230.204.0/30
15.230.205.0/24
15.251.0.10/31
43.224.76.4/30
43.224.76.8/29
43.224.76.16/29
43.224.76.72/29
43.224.76.80/30
43.224.76.104/29
43.224.76.112/29
43.224.76.128/28
43.224.76.152/29
43.224.76.160/27
43.224.76.208/28
43.224.76.224/30
43.224.76.232/29
43.224.77.24/29
43.224.77.36/30
43.224.77.40/29
43.224.77.76/30
43.224.77.80/28
43.224.77.96/29
43.224.77.120/29
43.224.77.128/29
43.224.77.192/30
43.224.77.208/29
43.224.79.26/31
43.224.79.28/30
43.224.79.32/30
43.224.79.36/31
43.224.79.42/31
43.224.79.44/31
43.224.79.50/31
43.224.79.52/30
43.224.79.56/30
43.224.79.60/31
43.224.79.70/31
43.224.79.72/29
43.224.79.80/31
43.224.79.106/31
43.224.79.108/30
43.224.79.112/31
43.224.79.120/29
43.224.79.128/30
43.224.79.136/29
43.224.79.144/31
43.224.79.158/31
43.224.79.160/31
43.224.79.166/31
43.224.79.168/31
43.224.79.190/31
43.224.79.192/29
43.224.79.200/31
43.224.79.212/30
43.224.79.216/30
43.224.79.244/30
43.224.79.248/30
52.46.128.0/19
52.46.164.0/22
52.46.168.0/22
52.46.188.24/29
52.46.188.36/30
52.46.188.40/30
52.46.188.60/30
52.46.188.64/30
52.46.188.76/30
52.46.188.80/29
52.46.188.88/30
52.46.188.108/30
52.46.188.120/30
52.46.188.132/30
52.46.188.136/29
52.46.188.144/30
52.46.188.156/30
52.46.188.160/28
52.46.188.176/29
52.46.188.184/30
52.46.188.204/30
52.46.188.208/30
52.46.188.224/27
52.46.189.0/30
52.46.189.12/30
52.46.189.16/30
52.46.189.36/30
52.46.189.40/29
52.46.189.48/28
52.46.189.64/27
52.46.189.96/29
52.46.189.128/28
52.46.189.156/30
52.46.189.160/30
52.46.189.168/29
52.46.189.200/29
52.46.189.240/29
52.46.190.0/28
52.46.190.32/29
52.46.190.56/29
52.46.190.92/30
52.46.190.104/29
52.46.190.206/31
52.46.190.208/31
52.46.190.214/31
52.46.190.216/31
52.46.190.230/31
52.46.190.232/31
52.46.190.242/31
52.46.190.244/31
52.46.190.254/31
52.46.191.0/30
52.46.191.4/31
52.46.191.10/31
52.46.191.12/31
52.46.191.22/31
52.46.191.24/31
52.46.191.46/31
52.46.191.48/31
52.46.191.52/30
52.46.191.60/30
52.46.191.64/30
52.46.191.68/31
52.46.191.82/31
52.46.191.84/30
52.46.191.88/29
52.46.191.96/28
52.46.191.120/30
52.46.191.128/29
52.46.191.136/31
52.46.191.140/30
52.46.191.144/31
52.46.191.148/30
52.46.191.156/30
52.46.191.168/29
52.46.191.176/29
52.46.191.188/30
52.46.191.192/30
52.46.191.200/30
52.46.191.214/31
52.46.191.216/30
52.46.191.220/31
52.46.191.226/31
52.46.191.228/31
52.46.191.238/31
52.46.191.240/31
52.46.250.0/23
52.46.252.0/22
52.93.1.0/24
52.93.3.0/24
52.93.4.0/24
52.93.50.128/26
52.93.50.192/30
52.93.51.28/31
52.93.59.0/24
52.93.60.0/24
52.93.64.0/24
52.93.76.0/24
52.93.87.96/27
52.93.91.96/28
52.93.91.112/30
52.93.97.0/24
52.93.123.98/31
52.93.123.136/32
52.93.123.255/32
52.93.126.122/31
52.93.126.212/30
52.93.127.18/31
52.93.127.68/31
52.93.127.122/31
52.93.127.124/31
52.93.127.162/31
52.93.127.164/30
52.93.127.168/31
52.93.127.172/31
52.93.127.180/30
52.93.127.184/31
52.93.127.200/31
52.93.127.216/30
52.93.127.220/31
52.93.236.0/24
52.93.249.0/24
52.93.254.0/24
52.94.68.0/24
52.94.124.0/22
52.94.132.0/22
52.94.152.9/32
52.94.152.11/32
52.94.152.12/32
52.94.152.44/32
52.94.192.0/22
52.94.224.0/20
52.94.240.0/21
52.94.252.0/22
52.95.41.0/24
52.95.48.0/21
52.95.62.0/23
52.95.108.0/23
52.95.208.0/22
52.95.216.0/22
52.119.196.0/22
52.119.206.0/23
52.119.212.0/22
52.144.192.0/24
52.144.193.0/25
52.144.193.128/26
52.144.194.0/26
52.144.195.0/26
52.144.200.64/26
52.144.200.128/26
54.239.0.0/28
54.239.8.0/21
54.239.16.0/20
54.239.98.0/24
54.239.104.0/23
54.239.108.0/22
54.239.112.0/24
54.240.196.0/24
54.240.202.0/24
54.240.208.0/22
54.240.216.0/22
54.240.228.0/23
54.240.232.0/22
67.220.240.0/20
69.107.3.176/28
69.107.7.32/28
69.107.7.64/28
69.107.7.96/28
72.21.192.0/19
99.78.192.0/22
99.82.176.0/21
99.82.188.0/22
99.83.64.0/21
99.83.88.0/21
99.83.112.0/21
104.255.56.11/32
104.255.56.12/32
150.222.2.0/24
150.222.15.124/30
150.222.66.0/24
150.222.71.0/24
150.222.73.0/24
150.222.76.0/24
150.222.79.0/24
150.222.87.0/24
150.222.99.0/24
150.222.100.0/24
150.222.110.0/24
150.222.136.0/24
150.222.138.0/24
150.222.143.0/24
150.222.205.0/24
150.222.206.0/24
150.222.212.0/24
150.222.218.0/24
150.222.222.0/23
150.222.224.0/24
150.222.226.0/23
150.222.236.0/23
172.96.97.0/24
176.32.96.0/21
176.32.120.0/22
176.32.124.128/25
176.32.125.192/26
199.127.232.0/22
205.251.224.0/22
205.251.240.0/21
205.251.248.0/24
207.171.160.0/19
209.54.176.0/21

@Jgreen or @Dwisehaupt could one of you please update that firewall?

I added sns.us-east-1.amazonaws.com to ipset, please let me know if this doesn't fix the timeouts.

They still seem to be coming in.

I'd like to at least turn off failmail for Amazon, but I'm not sure of the best way to do that. If I set the log-level down to 0 we won't get mail, but stuff won't go to syslog either so we won't know when it's fixed.

Sadly, the cascading config for SmashPig has a limitation for provider-specific overrides: it doesn't let us remove a value defined in provider-defaults by changing amazon/main.yaml, just change the values.

I could replace the FailmailLogStream with a TaggedFileLogStream that spams files to /tmp, or we could replace the destination email address for Amazon failmail with some bitbucket. Does nobody@wikimedia.org fit the bill, or would sending failmail there cause other problems?

Change 745999 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/SmashPig@master] Add no-op log stream

https://gerrit.wikimedia.org/r/745999

Change 745999 merged by jenkins-bot:

[wikimedia/fundraising/SmashPig@master] Add no-op log stream

https://gerrit.wikimedia.org/r/745999

Change 746000 had a related patch set uploaded (by Ejegg; author: Ejegg):

[wikimedia/fundraising/SmashPig@deployment] Add no-op log stream

https://gerrit.wikimedia.org/r/746000

Change 746000 merged by jenkins-bot:

[wikimedia/fundraising/SmashPig@deployment] Add no-op log stream

https://gerrit.wikimedia.org/r/746000

They still seem to be coming in.

Now I see why, two issues with ipset: one is that I had forgotten to actually add the policy to use the set, second is that they're using a 60s ttl on the DNS record which means it changes far too frequently to use ipset. I'll make some adjustments.