Page MenuHomePhabricator

Swift object servers become briefly unresponsive on a regular basis
Open, HighPublic

Description

These errors happen thousands of times per day on a given swift proxy server:

Jun 18 06:28:15 ms-fe1005 proxy-server: ERROR with Object server 10.64.32.223:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f4/f/f4/Flickr_-_%E2%80%A6trialsanderrors_-_Marion_Davies%2C_Ziegfeld_girl%2C_by_Alfred_Cheney_Johnston%2C_1924.jpg/800px-Flickr_-_%E2%80%A6trialsanderrors_-_Marion_Davies%2C_Ziegfeld_girl%2C_by_Alfred_Cheney_Johnston%2C_1924.jpg: ConnectionTimeout (0.5s) (txn: tx5facf4161a494da2b2556-005d08847f) (client_ip: x.x.x.x)

They result in 503s experienced by users.

It appears that different swift object servers (probably all) enter a state where they have hanging connections like this for a short time, resulting in small bursts of such errors.

We could increase the timeout at the swift proxy level to see if helps, but it would be good to figure out what's causing swift object servers to behave like this.

Event Timeline

Change 518658 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase swift proxy connection timeout to 1s

https://gerrit.wikimedia.org/r/518658

jijiki triaged this task as High priority.Jun 24 2019, 2:34 PM
jijiki moved this task from Backlog to Doing on the serviceops board.

Change 518658 merged by Effie Mouzeli:
[operations/puppet@production] Increase swift proxy connection timeout to 1s

https://gerrit.wikimedia.org/r/518658

Mentioned in SAL (#wikimedia-operations) [2019-06-27T10:48:30Z] <jijiki> Rolling restart ms-fe* proxy services for T226373 and T211661

Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...

Jun 27 11:00:05 ms-fe1005 proxy-server: ERROR with Object server 10.64.16.82:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.c5/c/c5/Barbara_Buchholz_playing_TVox.jpg/640px-Barbara_Buchholz_playing_TVox.jpg: ConnectionTimeout (1.0s) (txn: tx45334c0157ae45d7b88f2-005d14a1b4) (client_ip: x.x.x.x)
jijiki moved this task from Doing to Backlog on the serviceops board.
Joe added a subscriber: Joe.Jun 27 2019, 12:00 PM

Do we have metrics on the swift backends open connections / connections queues? without such information, I don't think we can understand what the problem is, what is causing it, and how to mitigate it.

@Joe I will start a more thorough investigation the following days, we'll see what will come up

Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...

Jun 27 11:00:05 ms-fe1005 proxy-server: ERROR with Object server 10.64.16.82:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.c5/c/c5/Barbara_Buchholz_playing_TVox.jpg/640px-Barbara_Buchholz_playing_TVox.jpg: ConnectionTimeout (1.0s) (txn: tx45334c0157ae45d7b88f2-005d14a1b4) (client_ip: x.x.x.x)

Also I think all/most replicas of a single object should timeout for an error to be reported by the proxy, have you observed e.g. connectiontimeout for the same object three times in a row?

At a glance on a given proxy the same object doesn't occur multiple times in a row. But the same destination object server has timeouts for several objects in a row, over a period of a few seconds.

Since the 1s timeout change didn't seem to have changed things, could we revert it please?

Change 520727 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] Revert "Increase swift proxy connection timeout to 1s"

https://gerrit.wikimedia.org/r/520727

CDanis added a subscriber: CDanis.Jul 4 2019, 3:36 PM
jijiki moved this task from Backlog to Next up on the serviceops board.Jul 12 2019, 8:16 AM

Change 520727 merged by Effie Mouzeli:
[operations/puppet@production] Revert "Increase swift proxy connection timeout to 1s"

https://gerrit.wikimedia.org/r/520727