Page MenuHomePhabricator

swift-account-stats failures on thanos-swift
Closed, InvalidPublic

Description

The swift-account-stats cron has been failing on thanos-fe1001 with the error below (causing cronspam, along with missing data points in account stats)

Traceback (most recent call last):
  File "/usr/local/bin/swift-account-stats", line 76, in <module>
    sys.exit(main())
  File "/usr/local/bin/swift-account-stats", line 49, in main
    headers = connection.head_account()
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 1847, in head_account
    return self._retry(None, head_account, headers=headers)
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 1801, in _retry
    rv = func(self.url, self.token, *args,
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 919, in head_account
    raise ClientException.from_response(resp, 'Account HEAD failed', body)
swiftclient.exceptions.ClientException: Account HEAD failed: https://thanos-swift.discovery.wmnet/v1/AUTH_thanos 401 Unauthorized

The accounts most frequently involved are thanos, chartmuseum and tegola, which makes me think this might be a load-related issue (other accounts don't see nearly the same levels of traffic)

Event Timeline

Mentioned in SAL (#wikimedia-operations) [2022-05-09T12:19:12Z] <godog> depool thanos-fe1001 to test load theory wrt account-stats failures - T307907

FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?

We have already been debugging this a bit.

When manually running the command I always get stats back. Even in a for-loop with 5 second sleep I could not reproduce it.

Jesse could reproduce it every Nth time when being more aggressive with sleep 1 or so.

@jhathaway

I did notice that there a lot of crons all running * * * * * and not randomized. So just sometimes these get rate-limited or so.

All of this did NOT explain why we see it only on one server and not all of them.

P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time to do that. Then as a side effect we can also "systemctl start" them a bunch of times. As opposed to manually running the cron command.

P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time to do that. Then as a side effect we can also "systemctl start" them a bunch of times. As opposed to manually running the cron command.

See T288806 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/778485.

Thank you folks for taking a look! I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/778485 (thanks @Zabe !) so at least the cronspam will stop, of course the root cause is still TBD.

FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?

AFAIK python-swift client should retry at least once (I've observed the retry with swift stat --debug when trying to reproduce). Also I likewise didn't find much in the logs so far

We have already been debugging this a bit.

When manually running the command I always get stats back. Even in a for-loop with 5 second sleep I could not reproduce it.

Jesse could reproduce it every Nth time when being more aggressive with sleep 1 or so.

@jhathaway

I did notice that there a lot of crons all running * * * * * and not randomized. So just sometimes these get rate-limited or so.

All of this did NOT explain why we see it only on one server and not all of them.

This is because these crons are singletons wrt the swift (and thanos) fleet, i.e. only run from one host at a time. It is possible though that there's some rate limiting involved, although so far my hunch is some kind of timeout either internally in swift and/or memcache (where the tokens are written/read)

Marostegui triaged this task as Medium priority.May 17 2022, 8:18 AM

I can't find any more errors for now, tentatively and optimistically resolving as invalid, will reopen if issues pop up again