Understand and solve wide row issues for frequently edited and re-rendered pages
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	• GWicke
	Mar 26 2015, 11:51 PM

Description

There is an extremely large number in the 'max' for 'Partition size' of cfhistograms:

nodetool cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             2.00             35.00           4768.00             20501                 2
75%             2.00             60.00           8239.00             61214                 4
95%             4.00            215.00          14237.00            379022                20
98%             4.00            310.00          20501.00            785939                29
99%             4.00            446.00          29521.00           1358102                35
Min             0.00              3.00             30.00              1332                 2
Max             7.00          51012.00         454826.00        8582860529             14237

We should perhaps investigate if that is accurate, and which partition this is. It could be a bug in restbase.

See: T107949

Related Objects
Search...

Status	Assigned	Task
Resolved	Eevans	T106619 investigate G1GC pause times
Duplicate	None	T94121 Understand and solve wide row issues for frequently edited and re-rendered pages
Declined	• mobrovac	T147366 Setup automated topk wide row reporting

Event Timeline

• GWicke created this task.Mar 26 2015, 11:51 PM

• GWicke raised the priority of this task from to Needs Triage.

• GWicke updated the task description. (Show Details)

• GWicke added a project: RESTBase-Cassandra.

• GWicke added subscribers: • GWicke, Eevans.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2015, 11:51 PM

• GWicke updated the task description. (Show Details)Mar 26 2015, 11:52 PM

• GWicke set Security to None.

Eevans added a parent task: T106619: investigate G1GC pause times.Jul 28 2015, 9:30 PM

After thinning and a couple of bugfixes we still have some very large rows in the db:

nodetool -h restbase1005.eqiad.wmnet cfstats local_group_wikipedia_T_parsoid_html.data
...
Compacted partition maximum bytes: 44285675122

The same is reported by cfhistograms:

nodetool -h restbase1005.eqiad.wmnet cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             1.00             50.00           2759.00              8239                 2
75%             2.00            103.00           4768.00             20501                 2
95%             3.00            310.00          11864.00             88148                10
98%             4.00            446.00          20501.00            219342                17
99%             5.00            642.00          29521.00            454826                24
Min             0.00              3.00             15.00                51                 0
Max            50.00         219342.00       25109160.00       44285675122             61214

That's a 44G partition. Earlier cfstats showed >50G. Other nodes are showing rows around 12G, so this is not limited to one node.

We did fix a couple of bugs recently that contributed to very large rows building up:

the retention policy limit bug (T105509) likely caused retention policy updates to fail completely once a row reached a size large enough to cause the limit-less select to time out
the data-parsoid rewrite bug (also in the retention policy code) bloated JSON values on rewrite by introducing many levels of escaping
some restbase jobs seem to have been retried by the job queue since the beginning of RESTBase time (T73853). We have recently adjusted the config, which might help to stop that. This is scheduled to be deployed today.
back in April/May we implemented several optimizations (If-Unmodified-Since support, diffing) that reduced the number of new renders being saved: T93751

As large rows were failing earlier in the thin-out script we should re-run the script again to make sure that all wide rows are fully thinned out. See T105706 for a (non-exhaustive) list of problematic keys.

Cassandra 2.1.8 (released on July 9) adds logging of partitions larger than a configured value, during compaction. I think this is our best bet in tracking down these large partitions.

See: T107949

Eevans updated the task description. (Show Details)Aug 4 2015, 10:00 PM

Note: Once https://github.com/wikimedia/cassandra-metrics-collector/pull/1 has landed, we will be able to configure dashboards to track partition size and column count (see also: T101764 and T97024).

Here are the top 10 partitions, by machine (as seen since upgrading to 2.1.8 in T107949).

restbase1001.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_html	data	it.wikipedia.org:Utente\:Biobot/log	28703917016
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	ur.wikipedia.org:نام_مقامات_بی	10511454705
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_بی	8580788986
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	ur.wikipedia.org:نام_مقامات_سی	4844690460
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Форум/Правила	2513578410
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	it.wikipedia.org:Utente\:Biobot/log	2337717280
local_group_wikipedia_T_parsoid_html	data	fr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes	2133724898
local_group_wikipedia_T_parsoid_html	data	de.wikipedia.org:Benutzer\:Hans-Jürgen_Hübner	1868370679
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:EranBot/Copyright/rc	1586112132
local_group_wikipedia_T_parsoid_html	data	hy.wikipedia.org:Վիքիպեդիա\:Նախագիծ\:Վիքիընդլայնում	1166074586

restbase1002.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	zh.wikipedia.org:User\:Cewbot/log/20150109	4793510676
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015_in_sports	3600773989
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Wikipedia\:Good_article_nominations	1828474829
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Запросы_к_администраторам	1720871806
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	en.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents	1433064680
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:DeltaQuad/UAA/Wait	1098413187
local_group_wikipedia_T_parsoid_section_offsets	data	zh.wikipedia.org:User\:Cewbot/log/20150109	1096252042
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round	1048028657
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Wikipedia\:In_the_news/Candidates	970543948
local_group_wikipedia_T_parsoid_html	data	uk.wikipedia.org:Вікіпедія\:Кнайпа_(різне)	935036351

restbase1003.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	ur.wikipedia.org:نام_مقامات_بی	10516723068
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_بی	8580788986
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:EranBot/Copyright/rc	1559688702
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	it.wikipedia.org:Utente\:Zabaleta5/Sandbox	1259342124
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:DeltaQuad/UAA/Wait	1098060948
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:Rankersbo/CSD_log	1072308058
local_group_wikipedia_T_parsoid_html	data	it.wikipedia.org:Utente\:Zabaleta5/Sandbox	993234104
local_group_wikipedia_T_parsoid_html	data	zh.wikipedia.org:Wikipedia\:关注度/提报	932057835
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:JamesR/AdminStats	878101807
local_group_wikipedia_T_parsoid_html	data	pl.wikipedia.org:Wikipedia\:Prośby_o_przejrzenie_edycji	851791725

restbase1004.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_html	data	it.wikipedia.org:Utente\:Biobot/log	28703744616
local_group_wikipedia_T_parsoid_html	data	zh.wikipedia.org:User\:Cewbot/log/20150109	7572976785
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	ur.wikipedia.org:نام_مقامات_سی	4827257639
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	zh.wikipedia.org:User\:Cewbot/log/20150109	4793510676
local_group_wikipedia_T_parsoid_html	data	zh.wikipedia.org:Wikipedia\:新条目推荐/候选	4089410626
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_سی	4009550454
local_group_wikipedia_T_title__revisions	idx_by_ns_ever	en.wikipedia.org:0	2810698543
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Форум/Правила	2504260141
local_group_wikipedia_T_parsoid_html	data	hu.wikipedia.org:Szerkesztő\:DVTK_KIADÓ/Új_cikkek	1554323145
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Список_кораблей_Военно-морского_флота_Российской_Федерации	1500485420

restbase1005.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	it.wikipedia.org:Utente\:Biobot/log	28226232375
local_group_wikipedia_T_parsoid_html	data	it.wikipedia.org:Utente\:Biobot/log	24686510754
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_بی	8580788986
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Форум/Правила	2504260141
local_group_wikipedia_T_parsoid_html	data	fr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes	2062112493
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_اے	1658481110
local_group_wikimedia_T_parsoid_html	data	commons.wikimedia.org:Commons\:WikiProject_Aviation/recent_uploads/2014_March_29	1296703023
local_group_wikipedia_T_parsoid_section_offsets	data	zh.wikipedia.org:User\:Cewbot/log/20150109	1096252042
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	fr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes	1024824908
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Проект\:Знаете_ли_вы/Подготовка_следующего_выпуска	872969665

restbase1006.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents	8735116469
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015_in_sports	3600773989
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	en.wikipedia.org:User\:JamesR/AdminStats	2623564374
local_group_wikipedia_T_title__revisions	idx_by_ns_ever	en.wikipedia.org:0	1655098018
local_group_wikipedia_T_parsoid_html	data	hy.wikipedia.org:Վիքիպեդիա\:Նախագիծ\:Վիքիընդլայնում	1140213871
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	en.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents	1125052734
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:Rankersbo/CSD_log	1094847666
local_group_wikipedia_T_parsoid_html	data	fr.wikipedia.org:Discussion\:Algérie_française	1089868913
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round	1048028657
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Template_talk\:Did_you_know	1035177290

restbase1007.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_html	data	ur.wikipedia.org:نام_مقامات_اے	1546373999
local_group_wikimedia_T_parsoid_html	data	commons.wikimedia.org:Commons\:WikiProject_Aviation/recent_uploads/2014_March_29	1296703023
local_group_wikipedia_T_parsoid_html	data	it.wikipedia.org:Utente\:Zabaleta5/Sandbox	1002476185
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Islamic_State_of_Iraq_and_the_Levant	975818734
local_group_wikipedia_T_parsoid_html	data	es.wikipedia.org:Club_Atlético_River_Plate	968332643
local_group_default_T_parsoid_dataW4ULtxs1oMqJeY	data	www.wikidata.org:User\:Liangent-bot/cleanupilh-wb-report/zhwiki	947876073
local_group_wikipedia_T_parsoid_html	data	pl.wikipedia.org:Wikipedia\:Prośby_o_przejrzenie_edycji	889715109
local_group_wikipedia_T_parsoid_html	data	fr.wikipedia.org:Wikipédia\:Le_saviez-vous_?/Anecdotes_proposées	857789028
local_group_default_T_parsoid_html	data	www.wikidata.org:User\:Liangent-bot/cleanupilh-wb-report/zhwiki	850190869
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Brazil_at_the_2015_Pan_American_Games	839486711

restbase1008.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015_in_sports	3600794283
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Запросы_к_администраторам	1671010077
local_group_wikipedia_T_parsoid_html	data	de.wikipedia.org:Wikipedia\:Auskunft	1326977223
local_group_default_T_parsoid_dataW4ULtxs1oMqJeY	data	www.wikidata.org:Wikidata\:The_Game/Flagged_items	1247500769
local_group_wikipedia_T_parsoid_html	data	fr.wikipedia.org:Discussion\:Algérie_française	1074096695
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round	1045680695
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Wikipedia\:In_the_news/Candidates	972909848
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Argentina_at_the_2015_Pan_American_Games	919980898
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	de.wikipedia.org:Wikipedia\:Auskunft	905483109
local_group_wiktionary_T_parsoid_html	data	en.wiktionary.org:Wiktionary\:Requests_for_verification	889189977

restbase1009.eqiad.wmnet

keyspace	cf	key	size
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	ur.wikipedia.org:نام_مقامات_سی	4835976337
local_group_wikimedia_T_parsoid_html	data	commons.wikimedia.org:Commons\:Quality_images_candidates/candidate_list	3272261467
local_group_wikipedia_T_parsoid_html	data	de.wikipedia.org:Wikipedia\:Auskunft	2860481698
local_group_wikipedia_T_parsoid_html	data	ru.wikipedia.org:Википедия\:Запросы_к_администраторам	1723227242
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJ	data	de.wikipedia.org:Wikipedia\:Auskunft	1575267394
local_group_default_T_parsoid_dataW4ULtxs1oMqJeY	data	www.wikidata.org:Wikidata\:The_Game/Flagged_items	1248684400
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:User\:DeltaQuad/UAA/Wait	1094117910
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Islamic_State_of_Iraq_and_the_Levant	978054350
local_group_default_T_parsoid_html	data	www.wikidata.org:Wikidata\:The_Game/Flagged_items	872100526
local_group_wikipedia_T_parsoid_html	data	en.wikipedia.org:Brazil_at_the_2015_Pan_American_Games	839486711

One-liner to print the biggest recent compactions on a node:

grep 'Compacting large' /var/log/cassandra/system.log | awk '{print $11, $10}' | sed -e 's/^(//' | sort -n

Looking at the biggest pages in this list, most of them are huge log pages (ex: http://it.wikipedia.org/wiki/Utente:Biobot/log) that are edited once a minute or so by a bot. This means that our retention policy is doing its job well, and the reason for these huge pages is simply many revisions. DTCS should reduce the impact of these crazy pages by breaking up these monster partitions into time slices.

• GWicke mentioned this in T120971: Blacklist automatic updates for especially expensive pages.Dec 9 2015, 5:32 PM

As a stop-gap, we have now blacklisted the most problematic pages from job queue updates. More thorough work to support large pages more efficiently will be happening in T120171.

Restricted Application added a subscriber: StudiesWorld. · View Herald TranscriptDec 11 2015, 1:57 AM

• GWicke mentioned this in T122028: RFC: Chunked storage algorithms for archival data vs. large-window brotli compression.Feb 4 2016, 11:04 PM

fgiunchedi mentioned this in T126221: Evaluate efficacy of DateTieredCompactionStrategy.Feb 8 2016, 6:43 PM

I investigated this a bit this weekend. With a few 10k revisions per title, I saw a significant performance drop right after writes, but almost all of this drop disappeared (for implicit and explicit requests for the latest revision) after a full compaction.

One factor to keep in mind is compression metadata. Whenever Cassandra is asked to read a revision that's not explicitly the latest, it needs to read & fully decode the compression offset index for the partition. For very wide rows, this takes time linear in the partition size. See this older article from Andrew Morton.

Eevans added a project: Cassandra.Apr 29 2016, 8:38 PM

Eevans closed this task as a duplicate of T143056: Address abnormally wide partitions.Sep 20 2016, 8:33 PM

Lets use this task to continue the discussion about possible solutions.

• GWicke renamed this task from Investigate huge max partition size in cfhistograms output to Solve wide row issues for frequently edited and re-rendered pages.Sep 27 2016, 3:58 PM

Eevans mentioned this in T146902: Cassandra outage: restbase1009-a.eqiad.wmnet.Sep 29 2016, 8:45 PM

Eevans merged a task: T143056: Address abnormally wide partitions.Oct 4 2016, 9:12 PM

Eevans mentioned this in T147366: Setup automated topk wide row reporting.Oct 4 2016, 9:24 PM

Eevans created subtask T147366: Setup automated topk wide row reporting.

• GWicke triaged this task as High priority.Oct 12 2016, 4:45 PM

• GWicke added a project: Services (next).

• GWicke mentioned this in T120171: RFC: Differentiate storage strategies for archival storage vs. hot current data.Oct 12 2016, 8:07 PM

To expand on my summary in T94121#2010218, another factor that makes wide partitions problematic is Cassandra's inability to rule out SSTables based on range keys. In RESTBase terms, this affects any query for "latest" information, which means any request that does not explicitly supply all parts of the range key (ex: revision *and* tid).

The main summary information Cassandra has about keys per SSTable is a bloom filter on the partition key (title). For a page with many revisions & when using a time or size based compaction strategy (anything but leveled), this bloom filter will indicate a match for most sstables. Without any additional information about key ranges, Cassandra can't rule out that any of these sstables contains a higher range key (revision or tid), so needs to hit each of those sstables to establish which one has the max. As described in T94121#2010218, this involves looking at compression metadata, and retrieving the data itself.

Things are a lot better for exact match queries (all parts of range key fixed). In this case, Cassandra can rule out other sstables based on timestamp ranges once it has found an exact match. Unfortunately, most of our use cases require range queries (max typically).

So, what can we do about this? Some options, some of which could be combined:

Go back to leveled compaction. Leveled splits sstables by partition key within each level, which means that the number of sstables containing a given partition key is typically smaller than using any other compaction strategy. The cost is higher compaction load / write amplification, especially with large instances.
Use smaller instances (more Cassandra instances, ScyllaDB). This reduces the number of sstables containing a given partition by reducing the total number of sstables. Can be combined with 1). In combination, leveled compaction might also become feasible again.
Improve Cassandra by introducing some efficient range key summary. This summary would provide min & possibly max values per partition key, which in turn would let Cassandra rule out SSTables for range queries based on this information.
Work around the issue by partitioning ranges by time or revision range.

• GWicke renamed this task from Solve wide row issues for frequently edited and re-rendered pages to Understand and solve wide row issues for frequently edited and re-rendered pages.Oct 12 2016, 8:50 PM

Eevans mentioned this in T149203: RESTBase-Cassandra wide partitions: period of 2016-10-17 to 2016-10-24.Oct 26 2016, 3:33 PM

Eevans mentioned this in T149572: RESTBase-Cassandra wide partitions: period of 2016-10-24 to 2016-10-31.Oct 31 2016, 2:32 PM

Eevans mentioned this in T150700: RESTBase-Cassandra wide partitions: period of 2016-11-06 to 2016-11-13.Nov 14 2016, 9:18 PM

Eevans mentioned this in T151771: Cassandra node outages; OutOfMemoryError exceptions.Nov 28 2016, 5:19 PM

Eevans moved this task from Backlog to In-Progress on the Cassandra board.Nov 29 2016, 9:38 PM

Eevans closed this task as a duplicate of T144431: RESTBase k-r-v as Cassandra anti-pattern.

• GWicke mentioned this in T144431: RESTBase k-r-v as Cassandra anti-pattern.Nov 30 2016, 10:19 PM

• GWicke mentioned this in T152724: Current state and next steps for RESTBase storage.Dec 8 2016, 9:32 PM

• GWicke mentioned this in T150811: Evaluate ScyllaDB as a near-term replacement to Cassandra.Dec 8 2016, 11:34 PM

• GWicke mentioned this in T153703: Option: Consider switching back to leveled compaction (LCS).Dec 19 2016, 7:00 PM

• mobrovac closed subtask T147366: Setup automated topk wide row reporting as Declined.Mar 14 2019, 12:47 AM

Legoktm mentioned this in T274359: Mobile REST API delivers year old+ content for very select pages.Nov 28 2021, 12:29 AM