Page MenuHomePhabricator

Many error 500 from pageviews API "Error in Cassandra table storage backend"
Closed, DuplicatePublic

Description

When running a rather small treeviews query like https://tools.wmflabs.org/glamtools/treeviews/?q=%7B%22lang%22%3A%22it%22%2C%22pagepile%22%3A%22492%22%2C%22rows%22%3A%5B%5D%7D (435 articles), logs show hundreds of requests getting an error 500 like:

{"type":"https://restbase.org/errors/https://restbase.org/errors/query_error","title":"Error in Cassandra table storage backend","method":"get","uri":"/analytics.wikimedia.org/v1/pageviews/per-article/it.wikipedia/all-access/user/Ferdinando_di_Diano/daily/20151201/20151231"}

On first attempt, I got 36 % of the data; on second attempt, 84 %; on 3rd, 97 %; on 4th, 100 %. Presumably the API needs to handle concurrent requests better with cold caches.

Not to be confused with incorrect 404: T134964: Invalid API input returns 404 instead of 500 or 400.

Related Objects

StatusAssignedTask
ResolvedOttomata
ResolvedRobH
Duplicatemobrovac
ResolvedMilimetric
ResolvedMilimetric
ResolvedOttomata
Resolvedmobrovac
ResolvedJAllemandou
ResolvedRobH
ResolvedJAllemandou
Resolvedelukey
Resolvedelukey
Resolvedelukey
Resolvedelukey
ResolvedNuria
ResolvedJAllemandou
ResolvedNuria
ResolvedJAllemandou
DuplicateJAllemandou
Resolvedelukey
Resolvedmobrovac

Event Timeline

Nemo_bis created this task.Jan 31 2016, 5:42 PM
Nemo_bis updated the task description. (Show Details)
Nemo_bis raised the priority of this task from to Normal.
Nemo_bis added a subscriber: Nemo_bis.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 31 2016, 5:42 PM
Nemo_bis set Security to None.Jan 31 2016, 5:46 PM
Nemo_bis added a subscriber: GWicke.
GWicke added a comment.EditedFeb 1 2016, 11:56 PM

This is caused by high iowait on the cassandra cluster. See T116097 and T124947.

This is caused by high iowait on the cassandra cluster. See T116097 and T124947.

Thanks. I'm not entirely convinced though: even if you increase the maximum throughput, there will always be a limit. When concurrency passes that limit, the API should be able to return something other than error 500.

Thanks. I'm not entirely convinced though: even if you increase the maximum throughput, there will always be a limit.

Cassandra performs much much better with SSDs, so the situation you witnessed should be brought down to a minimum after the disk upgrade. You are right that there will always be a limit, but with SSDs we should have enough headroom so that we don't need to optimise prematurely for extreme cases.

When concurrency passes that limit, the API should be able to return something other than error 500.

This problem should disappear soon, as we plan to implement rate limiting in the main RESTBase cluster, cf. T125123: Add rate limiter functionality to service-runner

mobrovac closed this task as Resolved.Feb 3 2016, 1:52 AM
mobrovac claimed this task.

The situation seems to have calmed down in the last 12h, so resolving.

Hm, yes, the example URL produces no error 500 for me now (I don't know whether the treeviews code changed). Nice! Either way, I linked this report to some users so that they can report if they see it again.

Nemo_bis updated the task description. (Show Details)May 11 2016, 6:10 AM

I still don't understand why return 500. An appropriate status code is 429.

I still don't understand why return 500. An appropriate status code is 429.

That was done in T135240.

Nemo_bis reopened this task as Open.Jun 5 2016, 7:06 AM

Better now, but still not fixed: with the example URL (and cold cache for it)

  • in Chromium 51 I got multiple 500 before getting any 429 (and the browser doesn't retry, should be reported),
  • in Firefox 46, with a category of similar size, I didn't manage to get any 429 but I got multiple 500:

Thanks for the bug report, looks like that 10 req/s limit is too high. For now we might not get to tuning it because we're busy trying to just upgrade it to be faster in the first place. But if the situation persists for too long, we'll throttle further.

Milimetric moved this task from Incoming to Backlog (Later) on the Analytics board.Jun 6 2016, 4:43 PM
Milimetric moved this task from Backlog (Later) to Dashiki on the Analytics board.
Milimetric raised the priority of this task from Normal to High.
Nuria added a subscriber: Nuria.Jul 4 2016, 4:37 PM

This issue will be fixed with the new cassandra cluster we are working on, with our current cluster the behaviour you see is the best we can do. Closing this ticket as a duplicate of the main one: https://phabricator.wikimedia.org/T124314

This issue will be fixed with the new cassandra cluster we are working on, with our current cluster the behaviour you see is the best we can do. Closing this ticket as a duplicate of the main one: https://phabricator.wikimedia.org/T124314

What's the timescale for this fix? I'm still not convinced that increasing capacity will solve the problem:

even if you increase the maximum throughput, there will always be a limit. When concurrency passes that limit, the API should be able to return something other than error 500.

Well, we are throttling and returning a 429 (too many requests) when we see access above a certain limit. It's just been hard to predict where that limit is for the old cluster because it's so overworked with spinning disks. We rushed it so we could get people pageview data faster, it was a conscious decision, and I think most people think it was a good one.

The new cluster is online, we're loading it with data right now. We're talking lots of Terrabytes that need to get compacted in Cassandra, so it'll take a while, and we can't estimate perfectly, but most likely pretty soon.

Once that's online, we'll load test it and figure out more precisely where its limits are, and throw 429 errors more predictably.

Akeron added a subscriber: Akeron.Sep 5 2016, 12:36 PM
Nuria added a comment.EditedSep 5 2016, 8:43 PM

What's the timescale for this fix? I'm still not convinced that increasing capacity will solve the problem:

The ETA to have new cluster in service is end of September. The throughput and throttling thresholds in the new cluster will be higher thus making this error less frequent

elukey added a subscriber: elukey.Sep 6 2016, 8:08 AM

Adding a bit more info about the new cluster for @Nemo_bis. The new hosts' SSDs allowed us to load Cassandra with different settings, like compaction set to Leveled rather than DTCS, that will bring more performances and stability. At the moment the major cause of the 500s are timeouts due to long disk seeks, that are slowing down Cassandra.

We tried to apply various workarounds to the current AQS cluster to improve its performances but we kept hitting the same hosts limits. If you want more data I added some thoughts in https://phabricator.wikimedia.org/T143873

Feel free to follow up with us on IRC if you have more questions :)