Page MenuHomePhabricator

Secondary updates create hundreds of unnecessary writes per second
Closed, ResolvedPublic

Assigned To
Authored By
GWicke
Jul 10 2015, 4:01 PM
Referenced Files
F191220: pasted_file
Jul 11 2015, 1:21 AM
F191222: pasted_file
Jul 11 2015, 1:21 AM
F191092: pasted_file
Jul 10 2015, 5:37 PM
F191077: pasted_file
Jul 10 2015, 5:05 PM

Description

While poking around in toppartitions output I noticed that we seem to rewrite an awful lot of renders per article. Some articles' renders are rewritten 180 times in 10 seconds. It seems that the index update code doesn't actually set an effective limit (possibly broken by the paging work?). It might also be nice to avoid rewriting renders that already have a TTL set.

In any case, this issue seems to be responsible for a good chunk of the write load and heap pressure we are seeing in production.

Sample from 1005:

nodetool -h restbase1004.eqiad.wmnet toppartitions local_group_wikipedia_T_parsoid_html data 20000
WRITES Sampler:
  Cardinality: ~2 (256 capacity)
  Top 10 partitions:
        Partition                                                                                       Count       +/-
        de.wikipedia.org:Benutzer\:Baumi                                                                  276         0
        fa.wikipedia.org:مدیاویکی\:Recentchangestext                                                      266         0
        fa.wikipedia.org:ویکیپدیا\:گزارش_دیتابیس/فهرست_مقالههای_پربیننده_از_۵۰۱_تا_۱۰۰۰/تاریخچه_۱       225         0
        fa.wikipedia.org:بحث_الگو\:سردر_تغییرات_اخیر                                                      181         0
        fa.wikipedia.org:الگو\:سردر_تغییرات_اخیر/سایر_ویکیها/افقی                                        176         0
        de.wikipedia.org:Benutzer\:Pixelfire                                                              176         0
        de.wikipedia.org:Benutzer\:Anachron                                                               163         0
        de.wikipedia.org:Benutzer\:Wahldresdner                                                           161         0
        de.wikipedia.org:Benutzer\:CeGe                                                                   151         0
        de.wikipedia.org:Benutzer\:Peter200                                                               151         0

READS Sampler:
  Cardinality: ~2 (256 capacity)
  Top 10 partitions:
        Partition                                                                                       Count       +/-
        de.wikipedia.org:Benutzer\:Baumi                                                                   14         0
        fa.wikipedia.org:مدیاویکی\:Recentchangestext                                                       13         0
        fa.wikipedia.org:بحث_الگو\:سردر_تغییرات_اخیر                                                       12         0
        de.wikipedia.org:Benutzer\:PeeCee                                                                  10         0
        fa.wikipedia.org:ویکیپدیا\:گزارش_دیتابیس/فهرست_مقالههای_پربیننده_از_۵۰۱_تا_۱۰۰۰/تاریخچه_۱        10         0
        de.wikipedia.org:Benutzer\:SchirmerPower/Statistik                                                  9         0
        de.wikipedia.org:Benutzer\:Anachron                                                                 9         0
        de.wikipedia.org:Benutzer\:Wahldresdner                                                             9         0
        de.wikipedia.org:Benutzer\:Peter200                                                                 8         0
        fa.wikipedia.org:الگو\:سردر_قهوهخانه/آزمایش                                                        8         0

Event Timeline

GWicke raised the priority of this task from to Needs Triage.
GWicke updated the task description. (Show Details)
GWicke subscribed.
GWicke triaged this task as Unbreak Now! priority.Jul 10 2015, 4:01 PM
GWicke set Security to None.

As a first step, we could add a somewhat hacky fix for the limit handling by manually appending + ' limit' + limit to the cql string *iff* limit is a number:

The effect of deploying the hotfix has been fairly dramatic:

  • read throughput dropped from >200mb/s to less than 50mb/s
  • write throughput dropped from about 50mb/s to less than 10mb/s
  • iowait dropped from about 10% to 0-1%
  • timeout errors in logstash disappeared
  • sstables per read p99 is heading in the right direction (should be at most 5 for HTML)

pasted_file (2×1 px, 929 KB)

pasted_file (1×1 px, 454 KB)

Longer term impact:

pasted_file (2×1 px, 1 MB)

pasted_file (1×1 px, 400 KB)

With the re-gained headroom we turned up compaction throughput again, which you can see as a slight increase in IO bandwidth and niced CPU cycles. Storage size, sstables per read and latencies are shrinking, and there were no OOM or other issues today.

GWicke lowered the priority of this task from Unbreak Now! to Medium.EditedJul 11 2015, 1:25 AM
GWicke added a subscriber: Eevans.

Lowering priority as the main issue is resolved. The code change we deployed this morning was only a hotfix however, and we'd like to clean this up more thoroughly. @Eevans is working on a solution that

a) cleans up limit handling without breaking paging, and
b) ideally avoids rewriting all but the replaced render by ignoring entries that already have a limited TTL set.

b) has the potential to further reduce write and compaction load from rerenders by about 50%.

Based on this experience, we should also make sure that we carefully test async update behavior against possible write amplification issues.

Lowering priority as the main issue is resolved. The code change we deployed this morning was only a hotfix however, and we'd like to clean this up more thoroughly. @Eevans is working on a solution that

a) cleans up limit handling without breaking paging, and
b) ideally avoids rewriting all but the replaced render by ignoring entries that already have a limited TTL set.

b) has the potential to further reduce write and compaction load from rerenders by about 50%.

Based on this experience, we should also make sure that we carefully test async update behavior against possible write amplification issues.

Completed in https://github.com/wikimedia/restbase-mod-table-cassandra/pull/122.

Could we set up unit tests to make sure that we don't repeat unnecessary TTL rewrites? One option I could imagine would be to check for the TTL after staggering writes across several seconds, and then verifying that their TTLs differ.

Could we set up unit tests to make sure that we don't repeat unnecessary TTL rewrites? One option I could imagine would be to check for the TTL after staggering writes across several seconds, and then verifying that their TTLs differ.

Done in https://github.com/wikimedia/restbase-mod-table-spec/pull/4