Page MenuHomePhabricator

Swift object servers become briefly unresponsive on a regular basis
Open, MediumPublic

Description

These errors happen thousands of times per day on a given swift proxy server:

Jun 18 06:28:15 ms-fe1005 proxy-server: ERROR with Object server 10.64.32.223:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.f4/f/f4/Flickr_-_%E2%80%A6trialsanderrors_-_Marion_Davies%2C_Ziegfeld_girl%2C_by_Alfred_Cheney_Johnston%2C_1924.jpg/800px-Flickr_-_%E2%80%A6trialsanderrors_-_Marion_Davies%2C_Ziegfeld_girl%2C_by_Alfred_Cheney_Johnston%2C_1924.jpg: ConnectionTimeout (0.5s) (txn: tx5facf4161a494da2b2556-005d08847f) (client_ip: x.x.x.x)

They result in 503s experienced by users.

It appears that different swift object servers (probably all) enter a state where they have hanging connections like this for a short time, resulting in small bursts of such errors.

We could increase the timeout at the swift proxy level to see if helps, but it would be good to figure out what's causing swift object servers to behave like this.

Details

Related Gerrit Patches:

Event Timeline

Change 518658 had a related patch set uploaded (by Gilles; owner: Gilles):
[operations/puppet@production] Increase swift proxy connection timeout to 1s

https://gerrit.wikimedia.org/r/518658

jijiki triaged this task as High priority.Jun 24 2019, 2:34 PM
jijiki moved this task from Backlog to Doing on the serviceops board.

Change 518658 merged by Effie Mouzeli:
[operations/puppet@production] Increase swift proxy connection timeout to 1s

https://gerrit.wikimedia.org/r/518658

Mentioned in SAL (#wikimedia-operations) [2019-06-27T10:48:30Z] <jijiki> Rolling restart ms-fe* proxy services for T226373 and T211661

Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...

Jun 27 11:00:05 ms-fe1005 proxy-server: ERROR with Object server 10.64.16.82:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.c5/c/c5/Barbara_Buchholz_playing_TVox.jpg/640px-Barbara_Buchholz_playing_TVox.jpg: ConnectionTimeout (1.0s) (txn: tx45334c0157ae45d7b88f2-005d14a1b4) (client_ip: x.x.x.x)
jijiki moved this task from Doing to Backlog on the serviceops board.
Joe added a subscriber: Joe.Jun 27 2019, 12:00 PM

Do we have metrics on the swift backends open connections / connections queues? without such information, I don't think we can understand what the problem is, what is causing it, and how to mitigate it.

@Joe I will start a more thorough investigation the following days, we'll see what will come up

Error rate hasn't gone down at all, now we're just getting errors that time out at 1s instead of 0.5s...

Jun 27 11:00:05 ms-fe1005 proxy-server: ERROR with Object server 10.64.16.82:6000/sdf1 re: Trying to GET /v1/AUTH_mw/wikipedia-commons-local-thumb.c5/c/c5/Barbara_Buchholz_playing_TVox.jpg/640px-Barbara_Buchholz_playing_TVox.jpg: ConnectionTimeout (1.0s) (txn: tx45334c0157ae45d7b88f2-005d14a1b4) (client_ip: x.x.x.x)

Also I think all/most replicas of a single object should timeout for an error to be reported by the proxy, have you observed e.g. connectiontimeout for the same object three times in a row?

At a glance on a given proxy the same object doesn't occur multiple times in a row. But the same destination object server has timeouts for several objects in a row, over a period of a few seconds.

Since the 1s timeout change didn't seem to have changed things, could we revert it please?

Change 520727 had a related patch set uploaded (by Effie Mouzeli; owner: Effie Mouzeli):
[operations/puppet@production] Revert "Increase swift proxy connection timeout to 1s"

https://gerrit.wikimedia.org/r/520727

CDanis added a subscriber: CDanis.Jul 4 2019, 3:36 PM
jijiki moved this task from Backlog to Next up on the serviceops board.Jul 12 2019, 8:16 AM

Change 520727 merged by Effie Mouzeli:
[operations/puppet@production] Revert "Increase swift proxy connection timeout to 1s"

https://gerrit.wikimedia.org/r/520727

Is this still an issue?

Yes:

gilles@ms-fe1005:/var/log/swift$ cat server.log.1 | grep ConnectionTimeout | wc -l
2048
Gilles moved this task from Inbox to Radar on the Performance-Team board.Oct 15 2019, 7:58 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
jijiki moved this task from Next up to Backlog on the serviceops board.Oct 21 2019, 9:41 AM

I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.

The ms-fe1005 proxy-access used contains entries from Dec 2 06:25:05 to Dec 4 06:25:05:

$ grep ConnectionTimeout server.log | awk '{print $19}' | sed 's/)$//' > txns
$ grep -F -f txns proxy-access.log > connectiontimeout-access.log
$ wc -l txns 
4293 txns
$ wc -l connectiontimeout-access.log 
4291 connectiontimeout-access.log
$ wc -l proxy-access.log
66745858

summarizing the connection timeout status codes:

$ head -1 connectiontimeout-access.log
Dec  2 06:25:09 ms-fe1005 proxy-server: 10.64.16.57 10.64.0.38 02/Dec/2019/06/25/09 GET /v1/AUTH_mw/wikipedia-commons-local-thumb.42/4/42/LE_SR-673.jpg/998px-LE_SR-673.jpg HTTP/1.0 404 - python-swiftclient-3.1.0 AUTH_XXX- 70 - tx8542563da70c4c6380daa-005de4ae44 - 0.5219 - - 1575267908.605798006 1575267909.127675056 0
$ cat connectiontimeout-access.log | awk '{print $12}'  | histogram.py --custom-buckets 200,300,400,500,600
# NumSamples = 4291; Min = 200.00; Max = 499.00
# Mean = 331.186903; Variance = 13614.458659; SD = 116.681012; Median 304.000000
# each ∎ represents a count of 25
  200.0000 -   200.0000 [  1509]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  200.0000 -   300.0000 [    62]: ∎∎
  300.0000 -   400.0000 [   774]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  400.0000 -   499.0000 [  1946]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎

So swift-proxy itself doesn't issue a 500, but logs a 499 which means the client disconnected before swift got a chance to reply. I haven't investigated though how ATS treats this condition and how it manifests to the end user (e.g. if retries are involved)

In terms of impact, the connection timeouts represent a very small fraction of all requests served by a single ms-fe host (0.006%)

re: ats and client timeouts and retries, yes ats does retry on origin timeout as it seems. Otherwise a 504 is returned to the user, for cache_upload there are indeed a few 504 on the backend but none on the tls frontend.

What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the status of the ticket- e.g. does ATS people need to be involved? Or just "being handled, just needs time"?

fgiunchedi lowered the priority of this task from High to Medium.Thu, Jan 2, 10:35 AM

What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the status of the ticket- e.g. does ATS people need to be involved? Or just "being handled, just needs time"?

Lowering to "normal" since the status is "known issue but not a whole lot of user-facing impact" and will need more investigation, pending resources