Page MenuHomePhabricator

Understand and solve wide row issues for frequently edited and re-rendered pages
Closed, DuplicatePublic

Description

There is an extremely large number in the 'max' for 'Partition size' of cfhistograms:

nodetool cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             2.00             35.00           4768.00             20501                 2
75%             2.00             60.00           8239.00             61214                 4
95%             4.00            215.00          14237.00            379022                20
98%             4.00            310.00          20501.00            785939                29
99%             4.00            446.00          29521.00           1358102                35
Min             0.00              3.00             30.00              1332                 2
Max             7.00          51012.00         454826.00        8582860529             14237

We should perhaps investigate if that is accurate, and which partition this is. It could be a bug in restbase.

See: T107949

Related Objects

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke added a project: RESTBase-Cassandra.
GWicke added subscribers: GWicke, Eevans.

After thinning and a couple of bugfixes we still have some very large rows in the db:

nodetool -h restbase1005.eqiad.wmnet cfstats local_group_wikipedia_T_parsoid_html.data
...
Compacted partition maximum bytes: 44285675122

The same is reported by cfhistograms:

nodetool -h restbase1005.eqiad.wmnet cfhistograms local_group_wikipedia_T_parsoid_html data
local_group_wikipedia_T_parsoid_html/data histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)                  
50%             1.00             50.00           2759.00              8239                 2
75%             2.00            103.00           4768.00             20501                 2
95%             3.00            310.00          11864.00             88148                10
98%             4.00            446.00          20501.00            219342                17
99%             5.00            642.00          29521.00            454826                24
Min             0.00              3.00             15.00                51                 0
Max            50.00         219342.00       25109160.00       44285675122             61214

That's a 44G partition. Earlier cfstats showed >50G. Other nodes are showing rows around 12G, so this is not limited to one node.

We did fix a couple of bugs recently that contributed to very large rows building up:

  • the retention policy limit bug (T105509) likely caused retention policy updates to fail completely once a row reached a size large enough to cause the limit-less select to time out
  • the data-parsoid rewrite bug (also in the retention policy code) bloated JSON values on rewrite by introducing many levels of escaping
  • some restbase jobs seem to have been retried by the job queue since the beginning of RESTBase time (T73853). We have recently adjusted the config, which might help to stop that. This is scheduled to be deployed today.
  • back in April/May we implemented several optimizations (If-Unmodified-Since support, diffing) that reduced the number of new renders being saved: T93751

As large rows were failing earlier in the thin-out script we should re-run the script again to make sure that all wide rows are fully thinned out. See T105706 for a (non-exhaustive) list of problematic keys.

Cassandra 2.1.8 (released on July 9) adds logging of partitions larger than a configured value, during compaction. I think this is our best bet in tracking down these large partitions.

See: T107949

Note: Once https://github.com/wikimedia/cassandra-metrics-collector/pull/1 has landed, we will be able to configure dashboards to track partition size and column count (see also: T101764 and T97024).

Here are the top 10 partitions, by machine (as seen since upgrading to 2.1.8 in T107949).

restbase1001.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_htmldatait.wikipedia.org:Utente\:Biobot/log28703917016
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataur.wikipedia.org:نام_مقامات_بی10511454705
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_بی8580788986
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataur.wikipedia.org:نام_مقامات_سی4844690460
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Форум/Правила2513578410
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatait.wikipedia.org:Utente\:Biobot/log2337717280
local_group_wikipedia_T_parsoid_htmldatafr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes2133724898
local_group_wikipedia_T_parsoid_htmldatade.wikipedia.org:Benutzer\:Hans-Jürgen_Hübner1868370679
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:EranBot/Copyright/rc1586112132
local_group_wikipedia_T_parsoid_htmldatahy.wikipedia.org:Վիքիպեդիա\:Նախագիծ\:Վիքիընդլայնում1166074586

restbase1002.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatazh.wikipedia.org:User\:Cewbot/log/201501094793510676
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015_in_sports3600773989
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Wikipedia\:Good_article_nominations1828474829
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Запросы_к_администраторам1720871806
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataen.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents1433064680
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:DeltaQuad/UAA/Wait1098413187
local_group_wikipedia_T_parsoid_section_offsetsdatazh.wikipedia.org:User\:Cewbot/log/201501091096252042
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round1048028657
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Wikipedia\:In_the_news/Candidates970543948
local_group_wikipedia_T_parsoid_htmldatauk.wikipedia.org:Вікіпедія\:Кнайпа_(різне)935036351

restbase1003.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataur.wikipedia.org:نام_مقامات_بی10516723068
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_بی8580788986
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:EranBot/Copyright/rc1559688702
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatait.wikipedia.org:Utente\:Zabaleta5/Sandbox1259342124
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:DeltaQuad/UAA/Wait1098060948
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:Rankersbo/CSD_log1072308058
local_group_wikipedia_T_parsoid_htmldatait.wikipedia.org:Utente\:Zabaleta5/Sandbox993234104
local_group_wikipedia_T_parsoid_htmldatazh.wikipedia.org:Wikipedia\:关注度/提报932057835
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:JamesR/AdminStats878101807
local_group_wikipedia_T_parsoid_htmldatapl.wikipedia.org:Wikipedia\:Prośby_o_przejrzenie_edycji851791725

restbase1004.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_htmldatait.wikipedia.org:Utente\:Biobot/log28703744616
local_group_wikipedia_T_parsoid_htmldatazh.wikipedia.org:User\:Cewbot/log/201501097572976785
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataur.wikipedia.org:نام_مقامات_سی4827257639
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatazh.wikipedia.org:User\:Cewbot/log/201501094793510676
local_group_wikipedia_T_parsoid_htmldatazh.wikipedia.org:Wikipedia\:新条目推荐/候选4089410626
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_سی4009550454
local_group_wikipedia_T_title__revisionsidx_by_ns_everen.wikipedia.org:02810698543
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Форум/Правила2504260141
local_group_wikipedia_T_parsoid_htmldatahu.wikipedia.org:Szerkesztő\:DVTK_KIADÓ/Új_cikkek1554323145
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Список_кораблей_Военно-морского_флота_Российской_Федерации1500485420

restbase1005.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatait.wikipedia.org:Utente\:Biobot/log28226232375
local_group_wikipedia_T_parsoid_htmldatait.wikipedia.org:Utente\:Biobot/log24686510754
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_بی8580788986
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Форум/Правила2504260141
local_group_wikipedia_T_parsoid_htmldatafr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes2062112493
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_اے1658481110
local_group_wikimedia_T_parsoid_htmldatacommons.wikimedia.org:Commons\:WikiProject_Aviation/recent_uploads/2014_March_291296703023
local_group_wikipedia_T_parsoid_section_offsetsdatazh.wikipedia.org:User\:Cewbot/log/201501091096252042
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatafr.wikipedia.org:Liste_des_walis_des_wilayas_algériennes1024824908
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Проект\:Знаете_ли_вы/Подготовка_следующего_выпуска872969665

restbase1006.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents8735116469
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015_in_sports3600773989
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataen.wikipedia.org:User\:JamesR/AdminStats2623564374
local_group_wikipedia_T_title__revisionsidx_by_ns_everen.wikipedia.org:01655098018
local_group_wikipedia_T_parsoid_htmldatahy.wikipedia.org:Վիքիպեդիա\:Նախագիծ\:Վիքիընդլայնում1140213871
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataen.wikipedia.org:Wikipedia\:Administrators'_noticeboard/Incidents1125052734
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:Rankersbo/CSD_log1094847666
local_group_wikipedia_T_parsoid_htmldatafr.wikipedia.org:Discussion\:Algérie_française1089868913
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round1048028657
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Template_talk\:Did_you_know1035177290

restbase1007.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_htmldataur.wikipedia.org:نام_مقامات_اے1546373999
local_group_wikimedia_T_parsoid_htmldatacommons.wikimedia.org:Commons\:WikiProject_Aviation/recent_uploads/2014_March_291296703023
local_group_wikipedia_T_parsoid_htmldatait.wikipedia.org:Utente\:Zabaleta5/Sandbox1002476185
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Islamic_State_of_Iraq_and_the_Levant975818734
local_group_wikipedia_T_parsoid_htmldataes.wikipedia.org:Club_Atlético_River_Plate968332643
local_group_default_T_parsoid_dataW4ULtxs1oMqJeYdatawww.wikidata.org:User\:Liangent-bot/cleanupilh-wb-report/zhwiki947876073
local_group_wikipedia_T_parsoid_htmldatapl.wikipedia.org:Wikipedia\:Prośby_o_przejrzenie_edycji889715109
local_group_wikipedia_T_parsoid_htmldatafr.wikipedia.org:Wikipédia\:Le_saviez-vous_?/Anecdotes_proposées857789028
local_group_default_T_parsoid_htmldatawww.wikidata.org:User\:Liangent-bot/cleanupilh-wb-report/zhwiki850190869
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Brazil_at_the_2015_Pan_American_Games839486711

restbase1008.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015_in_sports3600794283
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Запросы_к_администраторам1671010077
local_group_wikipedia_T_parsoid_htmldatade.wikipedia.org:Wikipedia\:Auskunft1326977223
local_group_default_T_parsoid_dataW4ULtxs1oMqJeYdatawww.wikidata.org:Wikidata\:The_Game/Flagged_items1247500769
local_group_wikipedia_T_parsoid_htmldatafr.wikipedia.org:Discussion\:Algérie_française1074096695
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:2015–16_UEFA_Europa_League_qualifying_phase_and_play-off_round1045680695
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Wikipedia\:In_the_news/Candidates972909848
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Argentina_at_the_2015_Pan_American_Games919980898
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatade.wikipedia.org:Wikipedia\:Auskunft905483109
local_group_wiktionary_T_parsoid_htmldataen.wiktionary.org:Wiktionary\:Requests_for_verification889189977

restbase1009.eqiad.wmnet

keyspacecfkeysize
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdataur.wikipedia.org:نام_مقامات_سی4835976337
local_group_wikimedia_T_parsoid_htmldatacommons.wikimedia.org:Commons\:Quality_images_candidates/candidate_list3272261467
local_group_wikipedia_T_parsoid_htmldatade.wikipedia.org:Wikipedia\:Auskunft2860481698
local_group_wikipedia_T_parsoid_htmldataru.wikipedia.org:Википедия\:Запросы_к_администраторам1723227242
local_group_wikipedia_T_parsoid_dataW4ULtxs1oMqJdatade.wikipedia.org:Wikipedia\:Auskunft1575267394
local_group_default_T_parsoid_dataW4ULtxs1oMqJeYdatawww.wikidata.org:Wikidata\:The_Game/Flagged_items1248684400
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:User\:DeltaQuad/UAA/Wait1094117910
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Islamic_State_of_Iraq_and_the_Levant978054350
local_group_default_T_parsoid_htmldatawww.wikidata.org:Wikidata\:The_Game/Flagged_items872100526
local_group_wikipedia_T_parsoid_htmldataen.wikipedia.org:Brazil_at_the_2015_Pan_American_Games839486711

One-liner to print the biggest recent compactions on a node:

grep 'Compacting large' /var/log/cassandra/system.log | awk '{print $11, $10}' | sed -e 's/^(//' | sort -n

Looking at the biggest pages in this list, most of them are huge log pages (ex: http://it.wikipedia.org/wiki/Utente:Biobot/log) that are edited once a minute or so by a bot. This means that our retention policy is doing its job well, and the reason for these huge pages is simply many revisions. DTCS should reduce the impact of these crazy pages by breaking up these monster partitions into time slices.

As a stop-gap, we have now blacklisted the most problematic pages from job queue updates. More thorough work to support large pages more efficiently will be happening in T120171.

I investigated this a bit this weekend. With a few 10k revisions per title, I saw a significant performance drop right after writes, but almost all of this drop disappeared (for implicit and explicit requests for the latest revision) after a full compaction.

One factor to keep in mind is compression metadata. Whenever Cassandra is asked to read a revision that's not explicitly the latest, it needs to read & fully decode the compression offset index for the partition. For very wide rows, this takes time linear in the partition size. See this older article from Andrew Morton.

Lets use this task to continue the discussion about possible solutions.

GWicke renamed this task from Investigate huge max partition size in cfhistograms output to Solve wide row issues for frequently edited and re-rendered pages.Sep 27 2016, 3:58 PM
GWicke added a project: Services (next).

To expand on my summary in T94121#2010218, another factor that makes wide partitions problematic is Cassandra's inability to rule out SSTables based on range keys. In RESTBase terms, this affects any query for "latest" information, which means any request that does not explicitly supply all parts of the range key (ex: revision *and* tid).

The main summary information Cassandra has about keys per SSTable is a bloom filter on the partition key (title). For a page with many revisions & when using a time or size based compaction strategy (anything but leveled), this bloom filter will indicate a match for most sstables. Without any additional information about key ranges, Cassandra can't rule out that any of these sstables contains a higher range key (revision or tid), so needs to hit each of those sstables to establish which one has the max. As described in T94121#2010218, this involves looking at compression metadata, and retrieving the data itself.

Things are a lot better for exact match queries (all parts of range key fixed). In this case, Cassandra can rule out other sstables based on timestamp ranges once it has found an exact match. Unfortunately, most of our use cases require range queries (max typically).

So, what can we do about this? Some options, some of which could be combined:

  1. Go back to leveled compaction. Leveled splits sstables by partition key within each level, which means that the number of sstables containing a given partition key is typically smaller than using any other compaction strategy. The cost is higher compaction load / write amplification, especially with large instances.
  2. Use smaller instances (more Cassandra instances, ScyllaDB). This reduces the number of sstables containing a given partition by reducing the total number of sstables. Can be combined with 1). In combination, leveled compaction might also become feasible again.
  3. Improve Cassandra by introducing some efficient range key summary. This summary would provide min & possibly max values per partition key, which in turn would let Cassandra rule out SSTables for range queries based on this information.
  4. Work around the issue by partitioning ranges by time or revision range.
GWicke renamed this task from Solve wide row issues for frequently edited and re-rendered pages to Understand and solve wide row issues for frequently edited and re-rendered pages.Oct 12 2016, 8:50 PM