Page MenuHomePhabricator

Understand and solve wide row issues for frequently edited and re-rendered pages
Closed, DuplicatePublic


There is an extremely large number in the 'max' for 'Partition size' of cfhistograms:

nodetool cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             2.00             35.00           4768.00             20501                 2
75%             2.00             60.00           8239.00             61214                 4
95%             4.00            215.00          14237.00            379022                20
98%             4.00            310.00          20501.00            785939                29
99%             4.00            446.00          29521.00           1358102                35
Min             0.00              3.00             30.00              1332                 2
Max             7.00          51012.00         454826.00        8582860529             14237

We should perhaps investigate if that is accurate, and which partition this is. It could be a bug in restbase.

See: T107949

Related Objects

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase-Cassandra.
GWicke added subscribers: GWicke, Eevans.

After thinning and a couple of bugfixes we still have some very large rows in the db:

nodetool -h restbase1005.eqiad.wmnet cfstats
Compacted partition maximum bytes: 44285675122

The same is reported by cfhistograms:

nodetool -h restbase1005.eqiad.wmnet cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             1.00             50.00           2759.00              8239                 2
75%             2.00            103.00           4768.00             20501                 2
95%             3.00            310.00          11864.00             88148                10
98%             4.00            446.00          20501.00            219342                17
99%             5.00            642.00          29521.00            454826                24
Min             0.00              3.00             15.00                51                 0
Max            50.00         219342.00       25109160.00       44285675122             61214

That's a 44G partition. Earlier cfstats showed >50G. Other nodes are showing rows around 12G, so this is not limited to one node.

We did fix a couple of bugs recently that contributed to very large rows building up:

  • the retention policy limit bug (T105509) likely caused retention policy updates to fail completely once a row reached a size large enough to cause the limit-less select to time out
  • the data-parsoid rewrite bug (also in the retention policy code) bloated JSON values on rewrite by introducing many levels of escaping
  • some restbase jobs seem to have been retried by the job queue since the beginning of RESTBase time (T73853). We have recently adjusted the config, which might help to stop that. This is scheduled to be deployed today.
  • back in April/May we implemented several optimizations (If-Unmodified-Since support, diffing) that reduced the number of new renders being saved: T93751

As large rows were failing earlier in the thin-out script we should re-run the script again to make sure that all wide rows are fully thinned out. See T105706 for a (non-exhaustive) list of problematic keys.

Cassandra 2.1.8 (released on July 9) adds logging of partitions larger than a configured value, during compaction. I think this is our best bet in tracking down these large partitions.

See: T107949

Note: Once has landed, we will be able to configure dashboards to track partition size and column count (see also: T101764 and T97024).

Here are the top 10 partitions, by machine (as seen since upgrading to 2.1.8 in T107949).



















One-liner to print the biggest recent compactions on a node:

grep 'Compacting large' /var/log/cassandra/system.log | awk '{print $11, $10}' | sed -e 's/^(//' | sort -n

Looking at the biggest pages in this list, most of them are huge log pages (ex: that are edited once a minute or so by a bot. This means that our retention policy is doing its job well, and the reason for these huge pages is simply many revisions. DTCS should reduce the impact of these crazy pages by breaking up these monster partitions into time slices.

As a stop-gap, we have now blacklisted the most problematic pages from job queue updates. More thorough work to support large pages more efficiently will be happening in T120171.

I investigated this a bit this weekend. With a few 10k revisions per title, I saw a significant performance drop right after writes, but almost all of this drop disappeared (for implicit and explicit requests for the latest revision) after a full compaction.

One factor to keep in mind is compression metadata. Whenever Cassandra is asked to read a revision that's not explicitly the latest, it needs to read & fully decode the compression offset index for the partition. For very wide rows, this takes time linear in the partition size. See this older article from Andrew Morton.

Lets use this task to continue the discussion about possible solutions.

GWicke renamed this task from Investigate huge max partition size in cfhistograms output to Solve wide row issues for frequently edited and re-rendered pages.Sep 27 2016, 3:58 PM
GWicke added a project: Services (next).

To expand on my summary in T94121#2010218, another factor that makes wide partitions problematic is Cassandra's inability to rule out SSTables based on range keys. In RESTBase terms, this affects any query for "latest" information, which means any request that does not explicitly supply all parts of the range key (ex: revision *and* tid).

The main summary information Cassandra has about keys per SSTable is a bloom filter on the partition key (title). For a page with many revisions & when using a time or size based compaction strategy (anything but leveled), this bloom filter will indicate a match for most sstables. Without any additional information about key ranges, Cassandra can't rule out that any of these sstables contains a higher range key (revision or tid), so needs to hit each of those sstables to establish which one has the max. As described in T94121#2010218, this involves looking at compression metadata, and retrieving the data itself.

Things are a lot better for exact match queries (all parts of range key fixed). In this case, Cassandra can rule out other sstables based on timestamp ranges once it has found an exact match. Unfortunately, most of our use cases require range queries (max typically).

So, what can we do about this? Some options, some of which could be combined:

  1. Go back to leveled compaction. Leveled splits sstables by partition key within each level, which means that the number of sstables containing a given partition key is typically smaller than using any other compaction strategy. The cost is higher compaction load / write amplification, especially with large instances.
  2. Use smaller instances (more Cassandra instances, ScyllaDB). This reduces the number of sstables containing a given partition by reducing the total number of sstables. Can be combined with 1). In combination, leveled compaction might also become feasible again.
  3. Improve Cassandra by introducing some efficient range key summary. This summary would provide min & possibly max values per partition key, which in turn would let Cassandra rule out SSTables for range queries based on this information.
  4. Work around the issue by partitioning ranges by time or revision range.
GWicke renamed this task from Solve wide row issues for frequently edited and re-rendered pages to Understand and solve wide row issues for frequently edited and re-rendered pages.Oct 12 2016, 8:50 PM