Page MenuHomePhabricator

Unchecked storage growth
Closed, ResolvedPublic

Description

The new storage design is meant to persist a window of recent versions (relative to the time of a documents last change) in order to support concurrency in Visual Editor. Range tombstones are issued probabilistically to cull past versions outside of this window. Determining the boundaries of this window is accomplished using an index that tracks when one document was superseded by another, and entries in this index are written with a TTL to prevent unbounded growth. There may be some flaw in this algorithm however, or to its implementation in RESTBase, because after several months of use, we continue to see linear growth in storage.

One theory put forward is that there may be a critical mass of documents with an edit frequency low enough to ensure no match is ever found in the index (because the records have expired), yet high enough to continually accumulate significant storage. If this is the case, it should be relatively straightforward to test by raising the index TTL and/or lowering period of recency (currently 24 hours). This would probably not be a solution per say however, because there would still exist an edit frequency where we continued to "leak" revisions.


Cleanup Tracking

Pass 1Pass 2TableCommon name
commons_T_mobile__ng_leadmobile_ng.lead
commons_T_mobile__ng_remainingmobile_ng.remaining
others_T_mobile__ng_leadmobile_ng.lead
others_T_mobile__ng_remainingmobile_ng.remaining
enwiki_T_parsoid__ng_section9wYKz1EvijuZtgPVP5tqparsoid_ng.section-offsets
others_T_parsoid__ng_section9wYKz1EvijuZtgPVP5tqparsoid_ng.section-offsets
commons_T_parsoid__ng_section9wYKz1EvijuZtgPVP5tparsoid_ng.section-offsets
wikipedia_T_parsoid__ng_section9wYKz1EvijuZtgPVPparsoid_ng.section-offsets
commons_T_parsoid__ng_data2pfX88WDfxDffCFHoTUDoyparsoid_ng.data-parsoid
others_T_parsoid__ng_data2pfX88WDfxDffCFHoTUDoy8parsoid_ng.data-parsoid
enwiki_T_mobile__ng_leadmobile_ng.lead
enwiki_T_parsoid__ng_data2pfX88WDfxDffCFHoTUDoy8parsoid_ng.data-parsoid
others_T_parsoid__ng_htmlparsoid_ng.html
commons_T_parsoid__ng_htmlparsoid_ng.html
wikipedia_T_parsoid__ng_data2pfX88WDfxDffCFHoTUDparsoid_ng.data-parsoid
wikipedia_T_mobile__ng_leadmobile_ng.lead
enwiki_T_parsoid__ng_htmlparsoid_ng.html
enwiki_T_mobile__ng_remainingmobile_ng.remaining
wikipedia_T_mobile__ng_remainingmobile_ng.remaining
wikipedia_T_parsoid__ng_htmlparsoid_ng.html

Details

Related Gerrit Patches:
mediawiki/services/restbase/deploy : masterConfig: Set the delete probability to 100%
mediawiki/services/restbase/deploy : masterConfig: Increase delete probability to 50%

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 428576 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Increase delete probability to 50%

https://gerrit.wikimedia.org/r/428576

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:03:51Z] <mobrovac@tin> Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points - T192689 T190846

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:17:04Z] <mobrovac@tin> Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points - T192689 T190846 (duration: 13m 13s)

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:18:07Z] <mobrovac@tin> Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #2 - T192689 T190846

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:21:09Z] <mobrovac@tin> Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #2 - T192689 T190846 (duration: 03m 03s)

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:21:23Z] <mobrovac@tin> Started deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #3 - T192689 T190846

Mentioned in SAL (#wikimedia-operations) [2018-04-24T09:25:49Z] <mobrovac@tin> Finished deploy [restbase/deploy@1661f69]: Increase the deletion probability to 50% and expose the CSS end points, take #3 - T192689 T190846 (duration: 04m 30s)

Change 428679 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/restbase/deploy@master] Config: Set the delete probability to 100%

https://gerrit.wikimedia.org/r/428679

Change 428679 merged by Mobrovac:
[mediawiki/services/restbase/deploy@master] Config: Set the delete probability to 100%

https://gerrit.wikimedia.org/r/428679

Mentioned in SAL (#wikimedia-operations) [2018-04-24T16:40:31Z] <mobrovac@tin> Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689

Mentioned in SAL (#wikimedia-operations) [2018-04-24T16:52:10Z] <mobrovac@tin> Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 (duration: 11m 40s)

Mentioned in SAL (#wikimedia-operations) [2018-04-24T16:55:50Z] <mobrovac@tin> Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:01:18Z] <mobrovac@tin> Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% - T192689 (duration: 05m 27s)

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:02:07Z] <mobrovac@tin> Started deploy [restbase/deploy@fbce520]: Set the delete probability to 100% [deploy to restbase1010] - T192689

Mentioned in SAL (#wikimedia-operations) [2018-04-24T17:04:11Z] <mobrovac@tin> Finished deploy [restbase/deploy@fbce520]: Set the delete probability to 100% [deploy to restbase1010] - T192689 (duration: 02m 04s)

One way to evaluate the extent to which this retention algorithm is able to "leak" revisions, would be to sample the data-set, bucketing on the distance between revisions and renders (age). This could be useful in determining the extent of the problem, and perhaps determine a reasonable timeline TTL that would close the loop in all but a statistically insignificant number of cases.

NOTE: Simply scanning a table like wikipedia_T_parsoid__ng_html.data from cqlsh makes it pretty obvious that we are not looking for needles in a haystack, but instead, sorting through an enormous pile of needles

One way to evaluate the extent to which this retention algorithm is able to "leak" revisions, would be to sample the data-set, bucketing on the distance between revisions and renders (age). This could be useful in determining the extent of the problem, and perhaps determine a reasonable timeline TTL that would close the loop in all but a statistically insignificant number of cases.

NOTE: Simply scanning a table like wikipedia_T_parsoid__ng_html.data from cqlsh makes it pretty obvious that we are not looking for needles in a haystack, but instead, sorting through an enormous pile of needles

In sampling records (manually), I encountered none where a timeline/index TTL of 3 months would not have been enough to close the hole here. There are no doubt some where 3 months would not be enough, but it seems clear that retention for the overwhelming majority would function as expected. Since even a TTL of 6 months would only ever result in an 8GB timeline for our worst-case, it seems safe-enough to double the back-of-napkin 3 month value, and set the default TTL property to 15552000 seconds.

Here are the ALTERs I propose:

1commons_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds_data:
2 statement: |
3 ALTER TABLE "commons_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds".data WITH default_time_to_live = 15552000
4
5commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w_data:
6 statement: |
7 ALTER TABLE "commons_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w".data WITH default_time_to_live = 15552000
8
9commons_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04_data:
10 statement: |
11 ALTER TABLE "commons_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04".data WITH default_time_to_live = 15552000
12
13commons_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0_data:
14 statement: |
15 ALTER TABLE "commons_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0".data WITH default_time_to_live = 15552000
16
17enwiki_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds_data:
18 statement: |
19 ALTER TABLE "enwiki_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds".data WITH default_time_to_live = 15552000
20
21enwiki_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w_data:
22 statement: |
23 ALTER TABLE "enwiki_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w".data WITH default_time_to_live = 15552000
24
25enwiki_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04_data:
26 statement: |
27 ALTER TABLE "enwiki_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04".data WITH default_time_to_live = 15552000
28
29enwiki_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0_data:
30 statement: |
31 ALTER TABLE "enwiki_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0".data WITH default_time_to_live = 15552000
32
33others_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds_data:
34 statement: |
35 ALTER TABLE "others_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLds".data WITH default_time_to_live = 15552000
36
37others_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w_data:
38 statement: |
39 ALTER TABLE "others_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_w".data WITH default_time_to_live = 15552000
40
41others_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04_data:
42 statement: |
43 ALTER TABLE "others_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP04".data WITH default_time_to_live = 15552000
44
45others_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0_data:
46 statement: |
47 ALTER TABLE "others_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWSp0".data WITH default_time_to_live = 15552000
48
49wikipedia_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLd_data:
50 statement: |
51 ALTER TABLE "wikipedia_T_mobile__ng3HeqOXXmkYfPizz4RPUR4OLXLd".data WITH default_time_to_live = 15552000
52
53wikipedia_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD__data:
54 statement: |
55 ALTER TABLE "wikipedia_T_mobile__ngR6XB1sh6_FFo_mfX4oZA56vpD_".data WITH default_time_to_live = 15552000
56
57wikipedia_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP_data:
58 statement: |
59 ALTER TABLE "wikipedia_T_parsoid__ng_4iO7iu2BPZmvx5_7kxNNNqVP".data WITH default_time_to_live = 15552000
60
61wikipedia_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWS_data:
62 statement: |
63 ALTER TABLE "wikipedia_T_parsoid__ngwEOh2HlqL6lr8QbNVKv1eSoWS".data WITH default_time_to_live = 15552000
64

NOTE: I believe this is correct, but help in reviewing would be appreciated (including, but not limited to, whether this is actually the correct list of keyspaces).

Mentioned in SAL (#wikimedia-operations) [2018-04-25T19:12:44Z] <urandom> altering timeline tables for 6 month TTL -- T192689

[ ... ]
In sampling records (manually), I encountered none where a timeline/index TTL of 3 months would not have been enough to close the hole here. There are no doubt some where 3 months would not be enough, but it seems clear that retention for the overwhelming majority would function as expected. Since even a TTL of 6 months would only ever result in an 8GB timeline for our worst-case, it seems safe-enough to double the back-of-napkin 3 month value, and set the default TTL property to 15552000 seconds.
Here are the ALTERs I propose:
[ ... ]

This is complete; The timeline table TTLs now default to 6 months.

Some concrete numbers showing the distance between versions (in time). Based on these numbers, our new index TTL (6 months) seems more than generous enough.

WeeksCountPercent
0181675433.73
183126015.43
255883710.37
34577828.50
43873597.19
53260666.05
62384104.43
71690983.14
81055671.96
9828291.54
10811071.51
111075042.00
12778841.45
13444630.83
14292550.54
15288920.54
16273490.51
1792160.17
1850520.09
1920360.04
2050.00

Interesting! So, based on these numbers, 50% of the data we are currently storing ought to have been cleaned up.

Mentioned in SAL (#wikimedia-operations) [2018-05-01T15:50:12Z] <urandom> restbase: begin culling leaked revisions, others_T_mobile__ng_lead -- T192689

Mentioned in SAL (#wikimedia-operations) [2018-05-01T19:49:18Z] <urandom> restbase: begin culling leaked revisions, others_T_mobile__ng_remaining - T192689

Eevans updated the task description. (Show Details)May 1 2018, 8:12 PM
Eevans updated the task description. (Show Details)May 1 2018, 8:16 PM
Eevans renamed this task from Unchecked storage growth(?) to Unchecked storage growth.May 1 2018, 8:20 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-01T20:28:50Z] <urandom> restbase: begin culling leaked revisions, commons_T_mobile__ng_lead - T192689

Eevans updated the task description. (Show Details)May 1 2018, 8:30 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-01T20:43:57Z] <urandom> restbase: begin culling leaked revisions, commons_T_mobile__ng_remaining - T192689

Eevans updated the task description. (Show Details)May 1 2018, 8:49 PM
Eevans updated the task description. (Show Details)May 1 2018, 8:57 PM
Eevans updated the task description. (Show Details)May 1 2018, 9:15 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-01T21:23:55Z] <urandom> restbase: begin culling leaked revisions, enwiki_T_mobile__ng_{lead,remaining} - T192689

Eevans added a comment.May 1 2018, 9:39 PM

Interesting! So, based on these numbers, 50% of the data we are currently storing ought to have been cleaned up.

The 0 bucket covers everything up to 7 days, the 1 bucket everything from 7 days and 1 second up to 14 days, so figure that 4 out of the 7 days represented there fall outside the 10 day TTL we used to have.

Eevans updated the task description. (Show Details)May 1 2018, 9:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:19:06Z] <urandom> force (split) compaction of wikipedia_T_mobile__ng_lead.data, restbase1016 - T192689

Eevans added a comment.EditedMay 9 2018, 2:52 PM

Culling of enwiki_T_mobile__{lead,remaining} in production has been on-going for a week now. All indications are that it is working, but progress is slow (that we are still working on the same two keyspaces speaks to that). We've avoided starting any additional invocations of the cleanup job out of a desire to not impose too much load on the cluster, but with utilization still steadily increasing, we might consider stepping things up a bit.

I propose the following:

Near-term

  • We continue culling enwiki_T_mobile__{lead,remaining} as-is
  • We begin concurrent cleanup tasks of other keyspaces (perhaps starting w/ wikipedia_T_mobile__{lead,remaining}?), and experiment with increasing task concurrency (the concurrency of DELETEs)

Latest version of the cleanup code: https://git.io/vpP6q

Longer-term

I've prototyped some Java code that works almost identically to the cleanup code above, but instead of issuing discrete DELETE statements, it creates SSTable files that can be bulk-imported using sstableimport. An import like this would stream these tombstone-only SSTables directly to applicable nodes where compaction would fold them into the existing dataset. If this works, it should be much more efficient, and could accelerate the process immensely. I will test this in the Dev environment, and if it checks out, make the code robust enough to use in production.

Mentioned in SAL (#wikimedia-operations) [2018-05-09T15:56:08Z] <urandom> starting revision cleanup job, wikipedia_T_mobile__ng_lead keyspace - T192689

Eevans updated the task description. (Show Details)May 11 2018, 1:55 AM

Yesterday evening the enwiki_T_mobile__ng_lead and enwiki_T_mobile__ng_remaining cleanup tasks ran to completion (after a bit more than a week). As an experiment, I restarted enwiki_T_mobile__ng_lead this morning (at the beginning of the token-space), and let it run for about an hour.

Partitions scanned: 6635152
Rows processed: 8296924
Deletes issued: 833755
Page state: 3d0010656e2e77696b6970656469612e6f7267000027436976696c5f5761725f4d656d6f7269616c5f28537963616d6f72652c5f496c6c696e6f6973290017000430b94263107cf95665428111e8a8dd88a5b2d3e578f07f815a1ff07ffffffe
WeeksCountPercent
0118483171.30
1707664.26
2476332.87
3382392.30
4420552.53
5471012.83
6928485.59
7334692.01
8167741.01
9160390.97
1049680.30
1144440.27
1227650.17
1342010.25
1441330.25
1585700.52
1698230.59
1751580.31
1817790.11
1910730.06
2014600.09
2113080.08
225050.03
233620.02
2422790.14
2527010.16
266060.04
2730120.18
2852270.31
2921420.13
3022800.14
3128260.17
322910.02
331030.01
NOTE: Compare these age distributions to https://phabricator.wikimedia.org/T192689#4170031

Yesterday evening the enwiki_T_mobile__ng_lead and enwiki_T_mobile__ng_remaining cleanup tasks ran to completion (after a bit more than a week). As an experiment, I restarted enwiki_T_mobile__ng_lead this morning (at the beginning of the token-space), and let it run for about an hour.

Partitions scanned: 6635152
Rows processed: 8296924
Deletes issued: 833755
Page state: 3d0010656e2e77696b6970656469612e6f7267000027436976696c5f5761725f4d656d6f7269616c5f28537963616d6f72652c5f496c6c696e6f6973290017000430b94263107cf95665428111e8a8dd88a5b2d3e578f07f815a1ff07ffffffe
WeeksCountPercent
0118483171.30
1707664.26
2476332.87
3382392.30
4420552.53
5471012.83
6928485.59
7334692.01
8167741.01
9160390.97
1049680.30
1144440.27
1227650.17
1342010.25
1441330.25
1585700.52
1698230.59
1751580.31
1817790.11
1910730.06
2014600.09
2113080.08
225050.03
233620.02
2422790.14
2527010.16
266060.04
2730120.18
2852270.31
2921420.13
3022800.14
3128260.17
322910.02
331030.01
NOTE: Compare these age distributions to https://phabricator.wikimedia.org/T192689#4170031

And here it is after re-running it again after a short bit:

Partitions scanned: 6394915
Rows processed: 7280001
Deletes issued: 6260
Page state: 320010656e2e77696b6970656469612e6f726700001c557365725f74616c6b3a4472696c426f742f456469746e6f746963650017000411d2f3e610c4f3a7f1550411e89fd5df3863b3e628f07f90d6f7f07ffffffe
WeeksCountPercent
088309499.78
15160.06
23340.04
3860.01
41380.02
51200.01
6730.01
73470.04
81760.02
9160.00
10170.00
11240.00
1260.00
1360.00
1410.00
15130.00
16160.00
17330.00
1860.00
1910.00
2000.00
21460.01
2200.00
2300.00
2400.00
2510.00
2600.00
2700.00
28130.00
2920.00
Eevans updated the task description. (Show Details)May 19 2018, 12:48 PM

Using the SSTable-generation/bulk-load approach I processed parsoid html, data-parsoid, and section-offsets for group commons. It took ~3.5 hours to generate the tables for all 3 concurrently (~57M partitions, ~193M rows, and ~65M deletes, each). Other than a small uptick in iowait on the node generating the table, impact on the cluster was almost imperceptible. The bulk-load process took ~15mins for each, and other than the associated network throughput, no impact on cluster performance was observed.

This approach seems very promising.

Eevans updated the task description. (Show Details)May 19 2018, 4:38 PM
Eevans updated the task description. (Show Details)May 19 2018, 4:46 PM
Eevans updated the task description. (Show Details)May 20 2018, 8:07 AM
Eevans updated the task description. (Show Details)May 20 2018, 12:28 PM
Eevans updated the task description. (Show Details)May 31 2018, 6:19 PM
Eevans updated the task description. (Show Details)May 31 2018, 6:26 PM
Eevans updated the task description. (Show Details)May 31 2018, 6:52 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)May 31 2018, 7:08 PM
Eevans updated the task description. (Show Details)May 31 2018, 7:21 PM
Eevans updated the task description. (Show Details)May 31 2018, 7:32 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)May 31 2018, 7:44 PM
Eevans updated the task description. (Show Details)May 31 2018, 7:52 PM
Eevans updated the task description. (Show Details)May 31 2018, 8:28 PM
Eevans updated the task description. (Show Details)May 31 2018, 8:34 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)May 31 2018, 8:36 PM
Eevans updated the task description. (Show Details)May 31 2018, 8:42 PM
Eevans updated the task description. (Show Details)May 31 2018, 9:40 PM
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)May 31 2018, 9:58 PM
Eevans updated the task description. (Show Details)May 31 2018, 10:26 PM
Eevans updated the task description. (Show Details)May 31 2018, 10:35 PM
Eevans updated the task description. (Show Details)May 31 2018, 11:27 PM
Eevans updated the task description. (Show Details)Jun 1 2018, 1:57 AM
Eevans updated the task description. (Show Details)Jun 1 2018, 1:01 PM
Eevans updated the task description. (Show Details)Jun 1 2018, 2:58 PM
Eevans updated the task description. (Show Details)Jun 1 2018, 8:55 PM
Eevans added a comment.EditedJun 2 2018, 1:43 AM

After a second pass of cleanups, a full set of major compactions on restbase1016.eqiad.wmnet yields the following utilization:

restbase1016
Filesystem      Size  Used Avail Use% Mounted on
udev             10M     0   10M   0% /dev
tmpfs            26G  2.4G   23G  10% /run
/dev/md0         28G  3.8G   23G  15% /
tmpfs            63G     0   63G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/sda4       1.4T  545G  876G  39% /srv/sda4
/dev/sdd4       1.4T  518G  903G  37% /srv/sdd4
/dev/sdb4       1.4T  648G  773G  46% /srv/sdb4
/dev/sdc4       1.4T  746G  676G  53% /srv/sdc4
/dev/md2         37G   25G   11G  70% /srv/cassandra/instance-data

restbase1007.eqiad.wmnet:

restbase1007
Filesystem      Size  Used Avail Use% Mounted on
udev             11M     0   11M   0% /dev
tmpfs            28G  2.7G   25G  10% /run
/dev/md0         30G  4.8G   24G  18% /
tmpfs            68G     0   68G   0% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs            68G     0   68G   0% /sys/fs/cgroup
/dev/sdd4       958G  584G  375G  61% /srv/sdd4
/dev/sdb4       958G  567G  392G  60% /srv/sdb4
/dev/sdc4       958G  631G  328G  66% /srv/sdc4
/dev/sda4       958G  395G  564G  42% /srv/sda4
/dev/sde4       958G  299G  660G  32% /srv/sde4
/dev/md2         50G   26G   21G  56% /srv/cassandra/instance-data
Eevans moved this task from Backlog to In-Progress on the User-Eevans board.Jun 13 2018, 7:33 PM
Eevans lowered the priority of this task from High to Medium.Jun 26 2018, 4:12 PM

Storage utilization continues (again) to steadily grow.

Last 90 days (eqiad)

It's still not clear to me whether there is still some issue algorithm-wise, or if this is simply the result of a need to adequately populate the indices after updating the TTL. However, since it is difficult to reason about how long we should expect to wait before seeing utilization stabilize, and since utilization is approaching 80% in places, I propose the following:

  1. We extend the code previously used to cull leaked records to also correspondingly update the applicable indices
  2. Make another pass over all keyspaces, and import the resulting SSTables (deletes + indices writes)
  3. Perform major compactions cluster-wide
Eevans raised the priority of this task from Medium to High.Jul 31 2018, 6:45 PM

Some keyspaces are above the threshold for single SSTable compaction (default: 20%), as an experiment, let's trying enabling unchecked_tombstone_compaction on a canary (enwiki_T_mobile__ng_remaining).

Estimated droppable tombstones (enwiki_T_mobile__ng_remaining)
$ cdsh -c restbase -d codfw -- "grep -vE \"0\.0\" reports/droppable_tombstones_2018-07-31.csv |grep enwiki_T_mobile__ng_remaining |awk -F \",\" '{ print \$4 \",\" \$1 \",\" \$2; }' | sort -n && echo \"~~~~~~\""
restbase2003.codfw.wmnet: 0.293,restbase2003-b,enwiki_T_mobile__ng_remaining
restbase2003.codfw.wmnet: 0.345,restbase2003-c,enwiki_T_mobile__ng_remaining
restbase2003.codfw.wmnet: 0.373,restbase2003-a,enwiki_T_mobile__ng_remaining
restbase2003.codfw.wmnet: ~~~~~~
restbase2004.codfw.wmnet: 0.272,restbase2004-c,enwiki_T_mobile__ng_remaining
restbase2004.codfw.wmnet: 0.305,restbase2004-b,enwiki_T_mobile__ng_remaining
restbase2004.codfw.wmnet: 0.381,restbase2004-a,enwiki_T_mobile__ng_remaining
restbase2004.codfw.wmnet: ~~~~~~
restbase2008.codfw.wmnet: 0.335,restbase2008-b,enwiki_T_mobile__ng_remaining
restbase2008.codfw.wmnet: 0.355,restbase2008-c,enwiki_T_mobile__ng_remaining
restbase2008.codfw.wmnet: 0.387,restbase2008-a,enwiki_T_mobile__ng_remaining
restbase2008.codfw.wmnet: ~~~~~~
restbase2011.codfw.wmnet: 0.291,restbase2011-a,enwiki_T_mobile__ng_remaining
restbase2011.codfw.wmnet: 0.332,restbase2011-b,enwiki_T_mobile__ng_remaining
restbase2011.codfw.wmnet: 0.342,restbase2011-c,enwiki_T_mobile__ng_remaining
restbase2011.codfw.wmnet: ~~~~~~
restbase2001.codfw.wmnet: 0.317,restbase2001-a,enwiki_T_mobile__ng_remaining
restbase2001.codfw.wmnet: 0.317,restbase2001-c,enwiki_T_mobile__ng_remaining
restbase2001.codfw.wmnet: 0.327,restbase2001-b,enwiki_T_mobile__ng_remaining
restbase2001.codfw.wmnet: ~~~~~~
restbase2002.codfw.wmnet: 0.320,restbase2002-c,enwiki_T_mobile__ng_remaining
restbase2002.codfw.wmnet: 0.322,restbase2002-b,enwiki_T_mobile__ng_remaining
restbase2002.codfw.wmnet: 0.340,restbase2002-a,enwiki_T_mobile__ng_remaining
restbase2002.codfw.wmnet: ~~~~~~
restbase2007.codfw.wmnet: 0.314,restbase2007-c,enwiki_T_mobile__ng_remaining
restbase2007.codfw.wmnet: 0.339,restbase2007-a,enwiki_T_mobile__ng_remaining
restbase2007.codfw.wmnet: 0.427,restbase2007-b,enwiki_T_mobile__ng_remaining
restbase2007.codfw.wmnet: ~~~~~~
restbase2010.codfw.wmnet: 0.319,restbase2010-b,enwiki_T_mobile__ng_remaining
restbase2010.codfw.wmnet: 0.345,restbase2010-c,enwiki_T_mobile__ng_remaining
restbase2010.codfw.wmnet: 0.360,restbase2010-a,enwiki_T_mobile__ng_remaining
restbase2010.codfw.wmnet: ~~~~~~
restbase2005.codfw.wmnet: 0.288,restbase2005-c,enwiki_T_mobile__ng_remaining
restbase2005.codfw.wmnet: 0.350,restbase2005-a,enwiki_T_mobile__ng_remaining
restbase2005.codfw.wmnet: 0.405,restbase2005-b,enwiki_T_mobile__ng_remaining
restbase2005.codfw.wmnet: ~~~~~~
restbase2006.codfw.wmnet: 0.272,restbase2006-a,enwiki_T_mobile__ng_remaining
restbase2006.codfw.wmnet: 0.272,restbase2006-c,enwiki_T_mobile__ng_remaining
restbase2006.codfw.wmnet: 0.389,restbase2006-b,enwiki_T_mobile__ng_remaining
restbase2006.codfw.wmnet: ~~~~~~
restbase2009.codfw.wmnet: 0.306,restbase2009-b,enwiki_T_mobile__ng_remaining
restbase2009.codfw.wmnet: 0.314,restbase2009-a,enwiki_T_mobile__ng_remaining
restbase2009.codfw.wmnet: 0.362,restbase2009-c,enwiki_T_mobile__ng_remaining
restbase2009.codfw.wmnet: ~~~~~~
restbase2012.codfw.wmnet: 0.338,restbase2012-a,enwiki_T_mobile__ng_remaining
restbase2012.codfw.wmnet: 0.369,restbase2012-c,enwiki_T_mobile__ng_remaining
restbase2012.codfw.wmnet: 0.370,restbase2012-b,enwiki_T_mobile__ng_remaining
restbase2012.codfw.wmnet: ~~~~~~
$
Eevans renamed this task from Unchecked storage growth to Unchecked storage growth(?).Jul 31 2018, 6:58 PM

Mentioned in SAL (#wikimedia-operations) [2018-07-31T19:00:41Z] <urandom> enabling unchecked_tombstone_compaction on enwiki_T_mobile__ng_remaining -- T192689

Ok.

The storage algorithm only evaluates renders for deletion when new renders are stored, and only those that were replaced by another TTL ago (or longer) are candidates for deletion. Likewise, revisions are only evaluated for deletion when a new revision is stored, and only if the corresponding renders were superseded TTL or more in the past. This means that in a perfect world, there will always be at least two renders stored, but possibly (many) more.

For example: Let's assume a TTL of 24 hours (86400 seconds). Imagine render 0 of revision A is stored for a new title.

revsionrendertimestamp
A02018-07-01T00:00:00

Subsequently render 1 of revision A is stored. Render 1 supersedes render 0 making render 0 a candidate for deletion TTL seconds from the time render 1 is written, but only after a new render is stored.

revsionrendertimestamp
A02018-07-01T00:00:00
A12018-07-03T00:00:00

Finally, revision 2 is stored, and if TTL seconds or more has elapsed between the writing of render 1, then render 0 can be deleted.

revsionrendertimestamp
A02018-07-01T00:00:00
A12018-07-03T00:00:00
A22018-07-05T00:00:00

Again, this is best-case scenario. In actuality, the same pattern holds true for revisions, and when combined with sub-TTL edits and or re-renders, the number of records persisted at any one time can be significant. Of course, over-stored records of this nature continue to be candidates for deletion, but only upon future writes (and imagine a scenario where a flurry of edits for a title occurs within the span of 24 hours, followed by a period of relative quiet lasting weeks or months).

This is distinctly different from the "leakage" we experienced when Cassandra TTLs on the indices were set too low; This overstorage is a property of the system (even if undesirable), one that we have been overlooking, and one that is difficult to quantify because it relates to the distribution and frequency of edits and re-renders, not to mention document size.


Now, some months ago we performed a round of culling intended to clean up after the aforementioned leakage. This process reduced everything to a single record. This put the dataset well below what we can expect storage to normalize at (remember: it could never be less than 2x of this, and will certainly be much more). Ergo, in hindsight, the steadily rising storage utilization should be entirely expected (even discounting the aforementioned "leakage").

And of course, all of this means that the numbers used in capacity planning, are wrong (by a lot). The real question of course is: What should we expect for overstorage; What multiplier should be used? I'm not (yet) sure how to establish this with any precision.

Eevans renamed this task from Unchecked storage growth(?) to Unchecked storage growth.Aug 1 2018, 6:47 PM
Eevans closed this task as Resolved.Aug 1 2018, 7:17 PM
Eevans claimed this task.

Since confidence is high at this point that the increased TTL has solved the original issue identified here, I will close this issue; I've opened T200937: RESTBase k-r-v storage overcommit to address the overcommit