swift-account-stats failures on thanos-swift
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	fgiunchedi
	May 9 2022, 12:16 PM

Description

The swift-account-stats cron has been failing on thanos-fe1001 with the error below (causing cronspam, along with missing data points in account stats)

Traceback (most recent call last):
  File "/usr/local/bin/swift-account-stats", line 76, in <module>
    sys.exit(main())
  File "/usr/local/bin/swift-account-stats", line 49, in main
    headers = connection.head_account()
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 1847, in head_account
    return self._retry(None, head_account, headers=headers)
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 1801, in _retry
    rv = func(self.url, self.token, *args,
  File "/usr/lib/python3/dist-packages/swiftclient/client.py", line 919, in head_account
    raise ClientException.from_response(resp, 'Account HEAD failed', body)
swiftclient.exceptions.ClientException: Account HEAD failed: https://thanos-swift.discovery.wmnet/v1/AUTH_thanos 401 Unauthorized

The accounts most frequently involved are thanos, chartmuseum and tegola, which makes me think this might be a load-related issue (other accounts don't see nearly the same levels of traffic)

Related Objects

Mentioned Here: T288806: Move swift crons to systemd timers
T273673: replace all puppet crons with systemd timers

Event Timeline

fgiunchedi created this task.May 9 2022, 12:16 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 9 2022, 12:16 PM

Mentioned in SAL (#wikimedia-operations) [2022-05-09T12:19:12Z] <godog> depool thanos-fe1001 to test load theory wrt account-stats failures - T307907

fgiunchedi added a project: User-fgiunchedi.May 9 2022, 12:19 PM

FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?

fgiunchedi moved this task from Backlog to Doing on the User-fgiunchedi board.May 9 2022, 3:58 PM

We have already been debugging this a bit.

When manually running the command I always get stats back. Even in a for-loop with 5 second sleep I could not reproduce it.

Jesse could reproduce it every Nth time when being more aggressive with sleep 1 or so.

@jhathaway

I did notice that there a lot of crons all running * * * * * and not randomized. So just sometimes these get rate-limited or so.

All of this did NOT explain why we see it only on one server and not all of them.

P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time to do that. Then as a side effect we can also "systemctl start" them a bunch of times. As opposed to manually running the cron command.

In T307907#7914831, @Dzahn wrote:

P.S. These are still crons and not systemd timers but T273673 says we should convert them all. So maybe this is a good time to do that. Then as a side effect we can also "systemctl start" them a bunch of times. As opposed to manually running the cron command.

See T288806 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/778485.

LSobanski subscribed.May 10 2022, 7:30 AM

Thank you folks for taking a look! I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/778485 (thanks @Zabe !) so at least the cronspam will stop, of course the root cause is still TBD.

In T307907#7914001, @MatthewVernon wrote:

FWIW, we do occasionally see this on ms-* too, but I can never repro on demand, which might support a load-related cause; I never found much in logs. Could maybe make swift-account-stats retry a couple of times?

AFAIK python-swift client should retry at least once (I've observed the retry with swift stat --debug when trying to reproduce). Also I likewise didn't find much in the logs so far

In T307907#7914812, @Dzahn wrote:

We have already been debugging this a bit.

When manually running the command I always get stats back. Even in a for-loop with 5 second sleep I could not reproduce it.

Jesse could reproduce it every Nth time when being more aggressive with sleep 1 or so.

@jhathaway

I did notice that there a lot of crons all running * * * * * and not randomized. So just sometimes these get rate-limited or so.

All of this did NOT explain why we see it only on one server and not all of them.

This is because these crons are singletons wrt the swift (and thanos) fleet, i.e. only run from one host at a time. It is possible though that there's some rate limiting involved, although so far my hunch is some kind of timeout either internally in swift and/or memcache (where the tokens are written/read)

Marostegui triaged this task as Medium priority.May 17 2022, 8:18 AM

I can't find any more errors for now, tentatively and optimistically resolving as invalid, will reopen if issues pop up again

swift-account-stats failures on thanos-swiftClosed, InvalidPublicActions

Description

Related Objects

Event Timeline

swift-account-stats failures on thanos-swift
Closed, InvalidPublic
Actions