Thumbor leaks pipes
Closed, ResolvedPublic

Description

Today thumbor in production filled up its disks with error logs in the form of "Too many open files", this is similar to what we've seen in T156913: Thumbor keeps too many file descriptors open though the leak appears to be pipes only.

I've restarted all instances on thumbor1002 and left thumbor1001 as is for now for inspection. It looks like only two instances on 1001 have hit the limit now (8836 / 8837) but all instances leak. For example 8827 is approaching its fd limit, I've captured some earlier log output in P4926. The errors seems related either not finding a suitable engine (recurring, seems to be normal) or some swift-related 401 / Unauthorized while PUTting thumbs. I _think_ the unauthorized might be related to having long-running swift tokens that eventually expire and the need to get authorized tokens again.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 13 2017, 10:44 AM

Mentioned in SAL (#wikimedia-operations) [2017-02-13T10:47:05Z] <godog> remove big/spammy log files from thubmro100[12] - T157949

So, you think that the swift client might be using a token that eventually expires, leaks a pipe every time it tries, until the thumbor process dies? I guess we can probably compare that error's count for a given process to the current amount of open pipes for it.

Gilles moved this task from Inbox to Next-up on the Performance-Team board.Feb 15 2017, 8:50 PM

@Gilles good question, I'm not sure expired swift tokens and pipe leaks are related but they could be very well be. e.g. now I'm not seeing pipes leaking but all thumbor instances have short uptime now

re: token life, ATM it is set to a week

$ git grep token_life modules/swift/
modules/swift/templates/proxy-server.conf.erb:token_life = 604800

Were the swift instances restarted around the time the 401 errors started showing up? I don't think I've tested the scenario of restarting swift after the thumbor instances have established a connection to it.

I did some log spelunking on a sample instance that was showing 401s, in this case thumbor1002 thumbor@8806

startup:
Feb 2 17:41:38 thumbor1002 thumbor@8806[102910]: Reading profile /etc/firejail/thumbor.profile

First PUT error:
Feb 2 18:10:35 thumbor1002 thumbor@8806[102910]: 2017-02-02 18:10:35,276 8806 swiftclient:ERROR Object PUT failed: ht

So it doesn't seem to be related to the natural swift token expiration I'd say, though that particular thumbor instance didn't seem to be able to recover from this state after the error and until restart

@Gilles no it doesn't look like swift proxy was restarted, I checked uptime on all frontends and it is Jan 20th for swift-proxy and 2016 for memcached

This comment was removed by Gilles.
This comment was removed by Gilles.
Gilles added a comment.EditedFeb 17 2017, 12:19 PM

I can't find a single log entry with HTTP/1.0 401 in the swift logs. It seems like those requests are not making it to swift at all? Some PUTS definitely get there from thumbor, but not the failing ones:

gilles@ms-fe1001:/var/log/swift$ cat proxy-access.log | grep PUT | grep python-swiftclient | grep 'HTTP/1.0 201' | wc -l
3220
gilles@ms-fe1001:/var/log/swift$ cat proxy-access.log | grep PUT | grep python-swiftclient | wc -l
3220

An interesting pattern, all the thumbnails that fail to PUT are video:

gilles@thumbor1002:/srv/log/thumbor$ cat thumbor.error.log | grep '401 Unauthorized' | grep -iv 'webm\|ogv\|ogg' | wc -l
0

Now I know where to start looking.

Gilles added a comment.EditedFeb 17 2017, 12:28 PM

Also, the 401's wording seems to be Apache's. But maybe Swift has the same, I'm not sure.

Maybe the issue has something to do with the sharing of the swift connection object across the whole thumbor instance. Going through the video loader is a different codepath than the swift loader.

In fact, I'm now thinking of a different issue: the video loader was optimized for accessing video originals through their swift url without auth. Which means it wouldn't work with temp containers. I don't know if videos can end up in temp containers, though. If they do, then that'll be an edge case to fix.

Gilles added a comment.EditedFeb 17 2017, 1:02 PM

Interestingly, a handful video thumbnails are PUT successfully in Swift:

gilles@ms-fe1001:/var/log/swift$ cat proxy-access.log | grep PUT | grep python-swiftclient | grep -i 'ogv\|webm\|ogg' | wc -l
41

I can't reproduce locally, swift connection reuse across the different loaders works fine on vagrant.

I can't even reproduce with a thumbor instance started manually on thumbor1002...

Sending a request to a relatively fresh instance: works. Sending a request to a broken running instance: triggers the bug.

Broken instances are more fucked than I expected. They 404 right away on image requests:

gilles@thumbor1002:~$ wget http://127.0.0.1:8804/wikipedia/commons/thumb/6/6d/Christophe_Henner_-_June_2016.jpg/491px-Christophe_Henner_-_June_2016.jpg
--2017-02-17 13:40:47--  http://127.0.0.1:8804/wikipedia/commons/thumb/6/6d/Christophe_Henner_-_June_2016.jpg/491px-Christophe_Henner_-_June_2016.jpg
Connecting to 127.0.0.1:8804... connected.
HTTP request sent, awaiting response... 404 Not Found
2017-02-17 13:40:47 ERROR 404: Not Found.

gilles@thumbor1002:~$ wget http://127.0.0.1:8899/wikipedia/commons/thumb/6/6d/Christophe_Henner_-_June_2016.jpg/491px-Christophe_Henner_-_June_2016.jpg
--2017-02-17 13:41:03--  http://127.0.0.1:8899/wikipedia/commons/thumb/6/6d/Christophe_Henner_-_June_2016.jpg/491px-Christophe_Henner_-_June_2016.jpg
Connecting to 127.0.0.1:8899... connected.
HTTP request sent, awaiting response... 200 OK
Length: 30629 (30K) [image/jpeg]
Saving to: ‘491px-Christophe_Henner_-_June_2016.jpg’

491px-Christophe_Henner_-_June_2016.jp 100%[===========================================================================>]  29.91K  --.-KB/s   in 0s     

2017-02-17 13:41:04 (334 MB/s) - ‘491px-Christophe_Henner_-_June_2016.jpg’ saved [30629/30629]

They only seem to be able to process video requests.

So far the symptoms are:

  • initially, thumbor instance processes requests fine
  • after some time, the instance breaks but keeps running. It 404s on requests other than videos. It succeeds on videos, but fails to save the resulting thumbnail in swift.

The 404 on requests other than video would suggest that the swift loader breaks as well. I.e. we can't connect to swift properly anymore with auth. But the video loader keep working because it hits Swift URLs directly via HTTP, without Swift auth.

This is actually a good indication of an auth token issue, since the swift loader is a different connection than the swift storage.

The reason why the GET failures are quiet is that we turn off error logging around them, due to noisy logging coming from the swift client when files are truly missing. On PUTs, we don't turn off logging, and the errors show up.

The reason why only videos seem to be affected on PUTs is because once the breaking point is past, they're the only type of content that can still be fetched and processed all the way to being PUT.

If this is tied to some kind expiry, we should be able to estimate it roughly by looking at the age of instances that exhibit the broken behavior or not.

This issue was masked before because we weren't reusing the swift connections, opening a new one for each request (causing the first huge pipe leak). Now that we are recycling them properly, at some point they stop working.

Trying all running instances on thumbor1002, the oldest to still run properly has been up for a little less than 4 hours. The youngest broken instance is a bit less than 15 hours old. There's such a big gap between the two groups, it's difficult to figure out if there's a threshold. At least the ones that work are consistently young and have been running for long enough that they should have handled the full variety of requests. I'll check on thumbor1001 to see if there's the same pattern there.

I took a quick look at the python swift client, particularly wondering why the 404 exceptions look identical in log messages when genuine 404s or when 401s. Like in the following, which should contain more information since it's most likely failing due to auth:

2017-02-17 14:18:13,540 8834 thumbor:ERROR [SWIFT_LOADER] get_object failed: http%3A//ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.6d/6/6d/Christophe_Henner_-_June_2016.jpg ClientException('Object GET failed',)

I see this:

https://github.com/openstack/python-swiftclient/blob/70c90b2243c2df9857a188cbd61d340b7e191d0d/swiftclient/client.py#L1163

https://github.com/openstack/python-swiftclient/blob/714bf0cd4b1fafc47d055ccc12edd4a51ed8cb91/swiftclient/exceptions.py#L39-L45

While the log message is super unhelpful because of the client's override, the underlying exception should contain the data we need to differentiate. And since we're constructing our error log entry ourselves, we might as well add the useful information to it...

The issue I'm having to estimate the expiry duration is that broken thumbor instances survive a long time, since all they process is video thumbnails and serve errors for everything else. All I can tell right now is that it seems to expire in less than 15 hours. The luck of the most recent restarts hasn't been enough to catch intermediary values live.

The big hammer solution would be to de-auth or even disconnect after each GET or PUT request. But maybe this can be finessed a bit more by deauthing only when the ClientException tell us that our token is invalid. But I need a way to reproduce this locally without keeping a Thumbor instance running for hours. It's still puzzling that @fgiunchedi found a token expiry of a week in configuration.

If I can't reproduce locally, I'll start by doing more verbose logging of ClientException values to differentiate these errors from genuine 404s at the GET level.

Gilles added a comment.EditedFeb 21 2017, 11:35 AM

I'm able to reproduce locally by setting the token expiry to 10 seconds and inspecting the ClientException's innards, we are indeed getting a 401 and not a 404. Subjectively, it felt like less than 10 seconds when it started failing, more like 5. Something strange must be happening with the way the token expiry works in practice. I would go as far as guessing that we are in practice subject to the default 24 hour value and it might be somehow halved when applied. Not that it really matters, but now that Thumbor survives longer, we have to make it work beyond Swift token expiry no matter what :)

Hmm I see that the latest version of the swift client has a retry mechanism for 401s: https://github.com/openstack/python-swiftclient/blob/70c90b2243c2df9857a188cbd61d340b7e191d0d/swiftclient/client.py#L1685 I need to check if the version we're running has the same logic.

Sigh... it's a bug in the 3.0.0 version of the client, which we're using, where auth retries and regular retries are conflated:

https://github.com/openstack/python-swiftclient/blob/3.0.0/swiftclient/client.py#L1581

Since we've set retries to 0, it doesn't retry for auth. More recent versions of the python swift client retry for auth correctly even when retries are set to 0 on the connection object.

I'll try the simplest workaround, which is to set retries to 1. Which would needlessly retry genuine 404s.

Confirmed, with SWIFT_RETRIES set to 1, everything is fine. Locally I can even see that requests triggering a reauth take longer.

The bug was actually only fixed recently:
https://github.com/openstack/python-swiftclient/commit/a62b7ee06c1a300c73d9a01b2c9bc05028f1fabd
https://bugs.launchpad.net/gnocchi/+bug/1589926

Change 338955 had a related patch set uploaded (by Gilles):
Lower token life for Swift proxy

https://gerrit.wikimedia.org/r/338955

Change 338955 merged by jenkins-bot:
Lower token life for Swift proxy

https://gerrit.wikimedia.org/r/338955

Change 338957 had a related patch set uploaded (by Gilles):
Increase SWIFT_RETRIES in Thumbor

https://gerrit.wikimedia.org/r/338957

Change 338957 merged by Filippo Giunchedi:
Increase SWIFT_RETRIES in Thumbor

https://gerrit.wikimedia.org/r/338957

@Gilles thanks for the investigation! I've merged https://gerrit.wikimedia.org/r/338957 let's see what happens over the next few hours/days

Can you check if we have more than 2 pipes per instance at the moment?

@Gilles looks like it a mixture of 2-4 pipes, I'm assuming during request processing but no greater than that:

# for i in $(pidof python) ; do lsof -p $i | grep -c ' pipe' ; done | tr $'\n' ' '
2 4 4 2 2 2 2 2 2 4 4 2 4 4 4 0 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 5 2 2 2 4 5 2 2 2 0
# for i in $(pidof python) ; do lsof -p $i | grep -c ' pipe' ; done | tr $'\n' ' '
5 2 2 2 4 2 2 0 2 2 2 2 2 5 2 4 4 2 2 2 2 4 2 2 2 5 2 2 2 2 2 2 2 2 5 2 2 2 4 2 0 2
Gilles closed this task as Resolved.Feb 27 2017, 6:21 PM

In that case, I'll close this task for now assuming that it was related to the Swift retry issue. I guess we'll find out if there's still a real pipe leak when we manage to get the instances to run for much longer.