Page MenuHomePhabricator

Consider what mechanism to use to make date deletion possible
Closed, ResolvedPublic

Description

In conversations with @JAllemandou, we have learnt that keeping an API like the Pageviews API over the course of many years has lead to some learnings on data retention. Mainly, keeping the data available forever has caused Cassandra performance issues. Attempting a data deletion now in such a database size is impractical, and the TTL feature of a Cassandra table can only be set at creation time.

Thus we should consider setting TTLs for all tables of Commons Impact Metrics before we launch.

In this ticket we should:

  • Wait until we have a final allow-list. Do a full run of all the ETL all the way to Cassandra, and see how many rows/data size we get for a couple of backfilled months.
  • Extrapolate what the size in Cassandra would be for all our 14 tables over 1, 3, 5 years.
  • With this info, consider whether we'd need to set TTLs, and if so, at what duration.
  • If no TTL, consider what to do long term with data: should we delete? If so, when should that happen? After 1, 3, 5 years? Long term we want T366631.

Details

Event Timeline

@JAllemandou could you elaborate on the performance issues?

One thing worth considering: Since TTLs are recorded at write-time, on a per-record basis, it makes changing your retention policy later very costly (probably prohibitively so).

An alternative: Everything in the post-Cassandra 3.x era supports range tombstones, meaning a DELETE FROM ... WHERE val < ? is possible, and very efficient. One tiny write, and Cassandra will (eventually) GC everything that is less-than, as part of normal compaction. This should work for any time-series datasets —like pageviews, and several of the Commons Input Metrics tables as they exist today (commons.category_metrics_snapshot for example). Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

No, because year & month are a part of the partition key; The category_scope, wiki, year, and month attributes are for all intents and purposes concatenated together. A range delete like this is something that has to be done against the (indexed) partition itself.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

No, because year & month are a part of the partition key; The category_scope, wiki, year, and month attributes are for all intents and purposes concatenated together. A range delete like this is something that has to be done against the (indexed) partition itself.

To expand on this a bit: When I said early that your other tables could probably be reworked, moving the year and month into a partition keyed by (category_scope, wiki) was the first thing that came to mind. So like this:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);

The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

...
To expand on this a bit: When I said early that your other tables could probably be reworked, moving the year and month into a partition keyed by (category_scope, wiki) was the first thing that came to mind. So like this:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);

The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

ranks cardinality is 1000. That is, we calcualte a max of 1000 rows per (category_scope, wiki, year, month). category_scope can only be one of shallow or deep. So cardinality is 2. wiki can be any of our ~900+ wikis.

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

Actually, it could be a DATE type, and then it would only cost 32-bits.

Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

Actually, it could be a DATE type, and then it would only cost 32-bits.

Well, the way you have it here wouldn't work because that value isn't indexed. If you moved it into PRIMARY KEY (ala PRIMARY KEY ((category_scope, wiki, year, month), internal_dt, rank) for example), your query becomes possible, but it won't remove the full record. The DELETE is only going to operate against the contents of the partition.

...

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);

The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

ranks cardinality is 1000. That is, we calcualte a max of 1000 rows per (category_scope, wiki, year, month). category_scope can only be one of shallow or deep. So cardinality is 2. wiki can be any of our ~900+ wikis.

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Ok, so we'd have 1800 partitions (distributed over 24 nodes/dc) that are 5 * 12 * 1000 rows "wide"? 60000 per partition? That seems Ok.

An important element of this is that your queries all qualify year & month, so even if the partition runs wide and growth is unbounded, your queries aren't/don't.


I guess the real question is: If we assume we can do it this way (a range DELETE instead of a TTL), what would be the strategy for implementing retention based on this? If your plan is to retain data for 5 years, then for all intents and purposes, nothing needs happen until 5 years from now. At 5 years and 1 month, you'd want some process to kick off and remove the corresponding months worth of data, for each of those 1800 partitions (setting aside other tables for the time being). Presumably that's some purpose built piece of code. Do you write that now and test it 5 years from now? Create a ticket post-dated 5 years in the future? @JAllemandou is wise, perhaps they already considered all of this before suggesting a TTL. 😀

Or, maybe the existing tables that need to be culled provide the impetus to create the infrastructure for culling older records?

Super interesting that cassandra can now do a delete via Range! This could help us apply a deletion-policy if we sometimes decide to use one!

Oh! It's great that we can do range deletions from Cassandra!
I think with the deprecation of Druid, we are adopting Cassandra as our main serving layer, and we should be able to delete data from it given a time range.
This would help do idempotent reruns, and implement deletion policies.

I think TTL is convenient, because it's easy. But I'm not sure is the good paradigm for long lived data?
It has some problems, like:

  • If we backfill say 1 year of data, then all that data is going to be deleted at the same time.
  • If we re-run past data, then the updated data will live longer than more recent data.
  • If data retention circumstances change during the years the data is to live (we need to delete the data sooner, or we don't need to delete it at all), we can not adapt to them.

If possible, I would lean towards implementing a deletion pipeline instead of using TTLs. We could maybe find a way to automate that via dataset config store?

Oh! It's great that we can do range deletions from Cassandra!
I think with the deprecation of Druid, we are adopting Cassandra as our main serving layer, and we should be able to delete data from it given a time range.
This would help do idempotent reruns, and implement deletion policies.

I think TTL is convenient, because it's easy. But I'm not sure is the good paradigm for long lived data?
It has some problems, like:

  • If we backfill say 1 year of data, then all that data is going to be deleted at the same time.
  • If we re-run past data, then the updated data will live longer than more recent data.

For me, both of these fall under the umbrella of (what I perceive to be) a lack of flexibility with your loader. I say that because Cassandra will allow you to assign the write time on INSERT/UPDATE, even if the Spark-based loader will not.

I don't know if this is something we'd ever have the time to work on, but I would love to come up with something that utilized Cassandra's bulk-loading capabilities. At a high-level, your import jobs would write Cassandra SSTables (there is a high-level, CQL-based abstraction for this) and drop them into a directory where they would be picked up for bulk-import. This would be incredibly fast/efficient. To the point where we could probably lift the requirement to run batch jobs during off-peak hours.

Presumably the piece that created SSTable files from results is something we'd have to build —an opportunity to replace the existing loader— and our opportunity to support things like writetime, or to implement retention (think: a policy that for each new row inserted, creates a range-delete for everything <= {retention_period}).

  • If data retention circumstances change during the years the data is to live (we need to delete the data sooner, or we don't need to delete it at all), we can not adapt to them.

If possible, I would lean towards implementing a deletion pipeline instead of using TTLs. We could maybe find a way to automate that via dataset config store?

I ran a notebook with some simple calculations with this template for each query:

year_months = [
    '2023-11',
    '2023-12',
    '2024-01',
    '2024-02',
    '2024-03',
    '2024-04',
]

row_counts = []
for year_month in year_months:
    row_counts.append(spark.sql(
        f"""
        hql goes here with {year_month} substitution
        """
    ).count())

print(f'Row counts: {row_counts}')

avg_of_6_months = int(sum(row_counts)/len(row_counts))
print(f'avg of 6 months is {avg_of_6_months}')

for years in [1,3,5,10]:
    print(f'Extrapolated to {years} years: {avg_of_6_months * 2 * years}')

Here are the results for each HQL file:

load_cassandra_commons_category_metrics_snapshot.hql

        Row counts: [692604, 710374, 724316, 736466, 742635, 749294]
        avg of 6 months is 725948
        Extrapolated to 1 years: 1451896
        Extrapolated to 3 years: 4355688
        Extrapolated to 5 years: 7259480
        Extrapolated to 10 years: 14518960


load_cassandra_commons_edits_per_category_monthly.hql

        Row counts: [309134, 269443, 283016, 291633, 268675, 272541]
        avg of 6 months is 282407
        Extrapolated to 1 years: 564814
        Extrapolated to 3 years: 1694442
        Extrapolated to 5 years: 2824070
        Extrapolated to 10 years: 5648140


load_cassandra_commons_edits_per_user_monthly.hql

        Row counts: [18875, 17613, 18718, 21221, 21956, 19370]
        avg of 6 months is 19625
        Extrapolated to 1 years: 39250
        Extrapolated to 3 years: 117750
        Extrapolated to 5 years: 196250
        Extrapolated to 10 years: 392500


load_cassandra_commons_media_file_metrics_snapshot.hql

        Row counts: [4709315, 4748909, 5093376, 5044512, 5172978, 5217583]
        avg of 6 months is 4997778
        Extrapolated to 1 years: 9995556
        Extrapolated to 3 years: 29986668
        Extrapolated to 5 years: 49977780
        Extrapolated to 10 years: 99955560


load_cassandra_commons_pageviews_per_category_monthly.hql

        Row counts: [5006153, 5109651, 5352705, 5336437, 5444292, 5559445]
        avg of 6 months is 5301447
        Extrapolated to 1 years: 10602894
        Extrapolated to 3 years: 31808682
        Extrapolated to 5 years: 53014470
        Extrapolated to 10 years: 106028940


load_cassandra_commons_pageviews_per_media_file_monthly.hql

        Row counts: [16733675, 16916077, 18314231, 17941258, 18457091, 18784188]
        avg of 6 months is 17857753
        Extrapolated to 1 years: 35715506
        Extrapolated to 3 years: 107146518
        Extrapolated to 5 years: 178577530
        Extrapolated to 10 years: 357155060


load_cassandra_commons_top_edited_categories_monthly.hql

        Row counts: [3194, 3268, 3325, 3347, 3313, 3277]
        avg of 6 months is 3287
        Extrapolated to 1 years: 6574
        Extrapolated to 3 years: 19722
        Extrapolated to 5 years: 32870
        Extrapolated to 10 years: 65740


load_cassandra_commons_top_editors_monthly.hql

        Row counts: [318925, 292033, 312174, 313196, 320420, 304375]
        avg of 6 months is 310187
        Extrapolated to 1 years: 620374
        Extrapolated to 3 years: 1861122
        Extrapolated to 5 years: 3101870
        Extrapolated to 10 years: 6203740


load_cassandra_commons_top_pages_per_category_monthly.hql

        Row counts: [34974791, 35413543, 37946799, 38171738, 39525931, 41312096]
        avg of 6 months is 37890816
        Extrapolated to 1 years: 75781632
        Extrapolated to 3 years: 227344896
        Extrapolated to 5 years: 378908160
        Extrapolated to 10 years: 757816320


load_cassandra_commons_top_pages_per_media_file_monthly.hql

        Row counts: [34840815, 35214849, 38785148, 38879803, 40378732, 42142899]
        avg of 6 months is 38373707
        Extrapolated to 1 years: 76747414
        Extrapolated to 3 years: 230242242
        Extrapolated to 5 years: 383737070
        Extrapolated to 10 years: 767474140


load_cassandra_commons_top_viewed_categories_monthly.hql

        Row counts: [488551, 490853, 503195, 499155, 512470, 522686]
        avg of 6 months is 502818
        Extrapolated to 1 years: 1005636
        Extrapolated to 3 years: 3016908
        Extrapolated to 5 years: 5028180
        Extrapolated to 10 years: 10056360


load_cassandra_commons_top_viewed_media_files_monthly.hql

        Row counts: [31602943, 32007472, 33760839, 33350496, 34096329, 34792779]
        avg of 6 months is 33268476
        Extrapolated to 1 years: 66536952
        Extrapolated to 3 years: 199610856
        Extrapolated to 5 years: 332684760
        Extrapolated to 10 years: 665369520


load_cassandra_commons_top_wikis_per_category_monthly.hql

        Row counts: [4565435, 4656221, 4891394, 4867686, 4972388, 5083295]
        avg of 6 months is 4839403
        Extrapolated to 1 years: 9678806
        Extrapolated to 3 years: 29036418
        Extrapolated to 5 years: 48394030
        Extrapolated to 10 years: 96788060


load_cassandra_commons_top_wikis_per_media_file_monthly.hql

        Row counts: [12023300, 12166315, 13219980, 12895879, 13283237, 13565730]
        avg of 6 months is 12859073
        Extrapolated to 1 years: 25718146
        Extrapolated to 3 years: 77154438
        Extrapolated to 5 years: 128590730
        Extrapolated to 10 years: 257181460

We experimented with lowering the top_* tables from having at most 1000 rows per metric, to at most 100. Here are the results:

HQLTop 1000 rows per metricTop 100 rows per metricPercentage of of original size
load_cassandra_commons_top_edited_categories_monthly.hqlRow counts: [3194, 3268, 3325, 3347, 3313, 3277]
avg of 6 months is 3287
Extrapolated to 1 years: 6574
Extrapolated to 3 years: 19722
Extrapolated to 5 years: 32870
Extrapolated to 10 years: 65740
Row counts: [400, 400, 400, 400, 400, 400]
avg of 6 months is 400
Extrapolated to 1 years: 800
Extrapolated to 3 years: 2400
Extrapolated to 5 years: 4000
Extrapolated to 10 years: 8000
12%
load_cassandra_commons_top_editors_monthly.hqlRow counts: [318925, 292033, 312174, 313196, 320420, 304375]
avg of 6 months is 310187
Extrapolated to 1 years: 620374
Extrapolated to 3 years: 1861122
Extrapolated to 5 years: 3101870
Extrapolated to 10 years: 6203740
Row counts: [291713, 266949, 285735, 284058, 288400, 277710]
avg of 6 months is 282427
Extrapolated to 1 years: 564854
Extrapolated to 3 years: 1694562
Extrapolated to 5 years: 2824270
Extrapolated to 10 years: 5648540
91%
load_cassandra_commons_top_pages_per_category_monthly.hqlRow counts: [34974791, 35413543, 37946799, 38171738, 39525931, 41312096]
avg of 6 months is 37890816
Extrapolated to 1 years: 75781632
Extrapolated to 3 years: 227344896
Extrapolated to 5 years: 378908160
Extrapolated to 10 years: 757816320
Row counts: [22438356, 22801039, 24014958, 24118421, 24735507, 25429991]
avg of 6 months is 23923045
Extrapolated to 1 years: 47846090
Extrapolated to 3 years: 143538270
Extrapolated to 5 years: 239230450
Extrapolated to 10 years: 478460900
63%
load_cassandra_commons_top_pages_per_media_file_monthly.hqlRow counts: [34840815, 35214849, 38785148, 38879803, 40378732, 42142899]
avg of 6 months is 38373707
Extrapolated to 1 years: 76747414
Extrapolated to 3 years: 230242242
Extrapolated to 5 years: 383737070
Extrapolated to 10 years: 767474140
Row counts: [31705607, 32066656, 34723326, 34306352, 35381242, 36337743]
avg of 6 months is 34086821
Extrapolated to 1 years: 68173642
Extrapolated to 3 years: 204520926
Extrapolated to 5 years: 340868210
Extrapolated to 10 years: 681736420
89%
load_cassandra_commons_top_viewed_categories_monthly.hqlRow counts: [488551, 490853, 503195, 499155, 512470, 522686]
avg of 6 months is 502818
Extrapolated to 1 years: 1005636
Extrapolated to 3 years: 3016908
Extrapolated to 5 years: 5028180
Extrapolated to 10 years: 10056360
Row counts: [112858, 112765, 115443, 114384, 116975, 118395]
avg of 6 months is 115136
Extrapolated to 1 years: 230272
Extrapolated to 3 years: 690816
Extrapolated to 5 years: 1151360
Extrapolated to 10 years: 2302720
23%
load_cassandra_commons_top_viewed_media_files_monthly.hqlRow counts: [31602943, 32007472, 33760839, 33350496, 34096329, 34792779]
avg of 6 months is 33268476
Extrapolated to 1 years: 66536952
Extrapolated to 3 years: 199610856
Extrapolated to 5 years: 332684760
Extrapolated to 10 years: 665369520
Row counts: [21145918, 21492724, 22515085, 22434407, 22864253, 23289161]
avg of 6 months is 22290258
Extrapolated to 1 years: 44580516
Extrapolated to 3 years: 133741548
Extrapolated to 5 years: 222902580
Extrapolated to 10 years: 445805160
67%
load_cassandra_commons_top_wikis_per_category_monthly.hqlRow counts: [4565435, 4656221, 4891394, 4867686, 4972388, 5083295]
avg of 6 months is 4839403
Extrapolated to 1 years: 9678806
Extrapolated to 3 years: 29036418
Extrapolated to 5 years: 48394030
Extrapolated to 10 years: 96788060
Row counts: [4411351, 4501125, 4727575, 4703233, 4798316, 4897946]
avg of 6 months is 4673257
Extrapolated to 1 years: 9346514
Extrapolated to 3 years: 28039542
Extrapolated to 5 years: 46732570
Extrapolated to 10 years: 93465140
97%
load_cassandra_commons_top_wikis_per_media_file_monthly.hqlRow counts: [12023300, 12166315, 13219980, 12895879, 13283237, 13565730]
avg of 6 months is 12859073
Extrapolated to 1 years: 25718146
Extrapolated to 3 years: 77154438
Extrapolated to 5 years: 128590730
Extrapolated to 10 years: 257181460
Row counts: [12002961, 12145673, 13198348, 12871148, 13256429, 13532457]
avg of 6 months is 12834502
Extrapolated to 1 years: 25669004
Extrapolated to 3 years: 77007012
Extrapolated to 5 years: 128345020
Extrapolated to 10 years: 256690040
100%

Analysis:

  • The tables that tended to benefit greatly from the reduction were also the ones with the least rows from the beginning.
  • Over 5 years we'd have a total of 1,280,477,670 rows vs 982,058,460, yielding ~77% of the original rows.

Conclusions: We are better off deleting data than reducing from Top 1000 to Top 100, however, if we did nothing else, this would still save ~300M rows. So I will go ahead with the change to tame a bit the row explosion.

Change #1037167 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Commons Impact Metrics: Reduce the top calculations from top 1000 to top 100.

https://gerrit.wikimedia.org/r/1037167

Change #1037167 merged by Xcollazo:

[analytics/refinery@master] Commons Impact Metrics: Reduce the top calculations from top 1000 to top 100.

https://gerrit.wikimedia.org/r/1037167

This comment was removed by xcollazo.

@Eevans regarding us having a deletion mechanism:

Going back to our running example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Since we have the original tables in the Datalake, we can calculate what is the MIN() and MAX() of all the columns that are part of the PRIMARY KEY:

columnMIN()MAX()
category_scopedeepshallow
wikiaawikizzwiki
year20242024
month012

So, if we put together a DELETE stament like the following:

DELETE
FROM commons.top_viewed_categories_monthly
 WHERE category_scope >= 'deep'
   AND category_scope <= 'shallow'
   AND wiki >= 'aawiki'
   AND wiki <= 'zzwiki'
   AND year >= 2024
   AND year <= 2024
   AND month >= 0
   AND month <= 12

Would this make Cassandra happy, and hopefully be fast?

@Eevans regarding us having a deletion mechanism:

Going back to our running example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Since we have the original tables in the Datalake, we can calculate what is the MIN() and MAX() of all the columns that are part of the PRIMARY KEY:

columnMIN()MAX()
category_scopedeepshallow
wikiaawikizzwiki
year20242024
month012

So, if we put together a DELETE stament like the following:

DELETE
FROM commons.top_viewed_categories_monthly
 WHERE category_scope >= 'deep'
   AND category_scope <= 'shallow'
   AND wiki >= 'aawiki'
   AND wiki <= 'zzwiki'
   AND year >= 2024
   AND year <= 2024
   AND month >= 0
   AND month <= 12

Would this make Cassandra happy, and hopefully be fast?

No, this won't work. Everything inside those inner parens (e.g. (category_scope, wiki, year, month)) is combined to form the partition key. You can't do anything but a by-value lookup of that partition key, which in turn requires specifying each constituent value (i.e. no ranges).

[ ... ]

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Ok, so we'd have 1800 partitions (distributed over 24 nodes/dc) that are 5 * 12 * 1000 rows "wide"? 60000 per partition? That seems Ok.

An important element of this is that your queries all qualify year & month, so even if the partition runs wide and growth is unbounded, your queries aren't/don't.


I guess the real question is: If we assume we can do it this way (a range DELETE instead of a TTL), what would be the strategy for implementing retention based on this? If your plan is to retain data for 5 years, then for all intents and purposes, nothing needs happen until 5 years from now. At 5 years and 1 month, you'd want some process to kick off and remove the corresponding months worth of data, for each of those 1800 partitions (setting aside other tables for the time being). Presumably that's some purpose built piece of code. Do you write that now and test it 5 years from now? Create a ticket post-dated 5 years in the future? @JAllemandou is wise, perhaps they already considered all of this before suggesting a TTL. 😀

Or, maybe the existing tables that need to be culled provide the impetus to create the infrastructure for culling older records?

Going back to this comment: It seems like it'd be OK (at least in this particular example you provided numbers for), to move year & month out of the partition key (so that they'd be indexed for range queries). I say 'OK' only because the distribution isn't as good (we'd have fewer, wider partitions), but the dimensions are still within reason. If it provided the means to implement more flexible retention, I'd call that a win. That's really the only concern I think, implementing that retention.

The way the schema is right now, the dataset can grow unbounded, yes, but other than storage resources used, it presents no concern. If we changed it, and moved year and month out of the partition key, then partitions will grow unbounded. It would probably take many years (many more than 5), but on a long enough timeline, that will eventually cause problems. So my (relatively minor) concern would be kicking the can so far down the road that nothing is ever done, and that one day our successors look back on us with contempt. 😀 In other words, we should probably make reasonable efforts to make sure that —if we do this— we make some reasonable effort to implement the deletes. I think we have some more immediate motivation to do so too, @JAllemandou has indicated a desire to cull records from pageviews, and the schema there already permits this. Anything we created could (nay should) be generic enough to accommodate all of these.

It would probably take many years (many more than 5), but on a long enough timeline, that will eventually cause problems. So my (relatively minor) concern would be kicking the can so far down the road that nothing is ever done, and that one day our successors look back on us with contempt.

Fair enough.

I think from the point of view of this task though, we are in agreement that:

  • TTLs are not the best mechanism to deal with this.
  • There is no immediate concern, even considering the row numbers shared on T364583#9843977.

Thus, I am closing this task, and opening a separate one to deal with a more generic solution to DELETE data from Cassandra.

Thus, I am closing this task, and opening a separate one to deal with a more generic solution to DELETE data from Cassandra.

Opened T366631.

Reopening to consider changing PRIMARY KEYs on the top_* tables given @Eevans comment on T366631:

Just for posterity sake:

Some of the Commons Impact Metrics tables already accommodate doing range deletes. Those that do not could be made to do so by moving year and month attributes into the composite key. This would make for fewer —wider— partitions, but not unacceptably so (at least based on the example discussed in T364583). Using range deletes is cheap, because you're writing a single tombstone to remove an entire range of values. As far as I can tell, this change to schema could be made without having to update any of the affected code (i.e. it changes nothing, except for how Cassandra organizes the data). It is not a change that can be made once data is in place though. In other words, now seems like an ideal time; Doing this later would require migration.

Point deletes will work too, though I suspect the tooling will end up being more bespoke, than something range based. It's also less efficient/elegant; It means a doubling of the number of transactions (however many are inserted in a day, would be the number of deletes to remove everything 5 years + 1 days old).

Here are the discrete cardinalities of the columns that are part of the PRIMARY KEYs of the top_* tables from Commons Impact Metrics:

cardinality(category) = 749294, will grow.
cardinality(category_scope) = 2
cardinality(wiki) = ~900, will grow slowly
cardinality(year) = ~5
cardinality (month) = 12
cardinality(rank) = 100
cardinality(media_file) = 5217583, will grow
cardinality(edit_type) = 3

Here is a table with the suggested PRIMARY KEY changes and their respective full PRIMARY KEY cardinalities:

Table nameOld PRIMARY KEY cardinalityNew PRIMARY KEY cardinality
commons.top_pages_per_category_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200
commons.top_wikis_per_category_monthlycardinality(category, category_scope, year, month) = 749294*2*5*12 = 89,915,280cardinality(category, category_scope)=749294*2 = 1,498,588
commons.top_viewed_categories_monthlycardinality(category_scope, wiki, year, month)= 2*900*5*12 = 108,000cardinality(category_scope, wiki)= 2*900 = 1,800
commons.top_pages_per_media_file_monthlycardinality(media_file, wiki, year, month) = 5217583*900*5*12 = 281,749,482,000cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700
commons.top_wikis_per_media_file_monthlycardinality(media_file, year, month) = 5217583*5*12 = 313,054,980cardinality(media_file) = 5217583
commons.top_viewed_media_files_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200
commons.top_edited_categories_monthlycardinality(category_scope, edit_type, year, month) = 2*3*5*12 = 360cardinality(category_scope, edit_type) = 2*3 = 6
commons.top_editors_monthlycardinality(category, category_scope, edit_type, year, month) = 749294*2*3*5*12 = 269,745,840cardinality(category, category_scope, edit_type) = 749294*2*3 = 4,495,764

Here is how the new DDL would look for all these top_* tables:

CREATE TABLE commons.top_pages_per_category_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    page_wiki           VARCHAR,
    page_title          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki), year, month, rank)
);

CREATE TABLE commons.top_wikis_per_category_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope), year, month, rank)
);

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki) year, month, rank)
);

CREATE TABLE commons.top_pages_per_media_file_monthly (
    media_file          VARCHAR,
    wiki                VARCHAR,
    page_wiki           VARCHAR,
    page_title          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file, wiki) year, month, rank)
);

CREATE TABLE commons.top_wikis_per_media_file_monthly (
    media_file          VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file) year, month, rank)
);

CREATE TABLE commons.top_viewed_media_files_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    media_file          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki) year, month, rank)
);

CREATE TABLE commons.top_edited_categories_monthly (
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    category            VARCHAR,
    edit_count          BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, edit_type) year, month, rank)
);

CREATE TABLE commons.top_editors_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    user_name           VARCHAR,
    edit_count          BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, edit_type) year, month, rank)
);

@Eevans please take a look at the cardinality numbers on T364583#9864290, and if you think they look ok, let's go for it and change them now rather than later, as you suggest.

@Eevans please take a look at the cardinality numbers on T364583#9864290, and if you think they look ok, let's go for it and change them now rather than later, as you suggest.

Could you update that table to include the number of columns per partition, in each?

Could you update that table to include the number of columns per partition, in each?

I think you meant rows? In that case, we have about 6,000 rows per partition:

Table nameOld PRIMARY KEY cardinalityNew PRIMARY KEY cardinalityNew CLUSTERING cardinality
commons.top_pages_per_category_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_category_monthlycardinality(category, category_scope, year, month) = 749294*2*5*12 = 89,915,280cardinality(category, category_scope)=749294*2 = 1,498,588cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_categories_monthlycardinality(category_scope, wiki, year, month)= 2*900*5*12 = 108,000cardinality(category_scope, wiki)= 2*900 = 1,800cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_pages_per_media_file_monthlycardinality(media_file, wiki, year, month) = 5217583*900*5*12 = 281,749,482,000cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_media_file_monthlycardinality(media_file, year, month) = 5217583*5*12 = 313,054,980cardinality(media_file) = 5217583cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_media_files_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_edited_categories_monthlycardinality(category_scope, edit_type, year, month) = 2*3*5*12 = 360cardinality(category_scope, edit_type) = 2*3 = 6cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_editors_monthlycardinality(category, category_scope, edit_type, year, month) = 749294*2*3*5*12 = 269,745,840cardinality(category, category_scope, edit_type) = 749294*2*3 = 4,495,764cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

Could you update that table to include the number of columns per partition, in each?

I think you meant rows? In that case, we have about 6,000 rows per partition:

Table nameOld PRIMARY KEY cardinalityNew PRIMARY KEY cardinalityNew CLUSTERING cardinality
commons.top_pages_per_category_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_category_monthlycardinality(category, category_scope, year, month) = 749294*2*5*12 = 89,915,280cardinality(category, category_scope)=749294*2 = 1,498,588cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_categories_monthlycardinality(category_scope, wiki, year, month)= 2*900*5*12 = 108,000cardinality(category_scope, wiki)= 2*900 = 1,800cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_pages_per_media_file_monthlycardinality(media_file, wiki, year, month) = 5217583*900*5*12 = 281,749,482,000cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_media_file_monthlycardinality(media_file, year, month) = 5217583*5*12 = 313,054,980cardinality(media_file) = 5217583cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_media_files_monthlycardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_edited_categories_monthlycardinality(category_scope, edit_type, year, month) = 2*3*5*12 = 360cardinality(category_scope, edit_type) = 2*3 = 6cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_editors_monthlycardinality(category, category_scope, edit_type, year, month) = 749294*2*3*5*12 = 269,745,840cardinality(category, category_scope, edit_type) = 749294*2*3 = 4,495,764cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

These are totally fine, we should definitely do this.

Are you OK to have the existing tables removed (along with any data you've already loaded), and recreated? For both clusters, aqs and cassandra-dev?

Are you OK to have the existing tables removed (along with any data you've already loaded), and recreated? For both clusters, aqs and cassandra-dev?

Yes and yes.

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

xcollazo renamed this task from Consider whether we want to set a TTL on the Cassandra tables to Consider what mechanism to use to make date deletion possible.Thu, Jun 6, 4:24 PM
xcollazo updated the task description. (Show Details)

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

I'm having trouble testing data loading on the Cassandra Staging cluster. Looks like we need to redefine aqsloader permissions on the re-created tables:

...
	Suppressed: java.io.IOException: Failed to write statements to commons.top_edited_categories_monthly. The
latest exception was
  User aqsloader has no MODIFY permission on <table commons.top_edited_categories_monthly> or any of its parents

Please check the executor logs for more exceptions and information

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

I'm having trouble testing data loading on the Cassandra Staging cluster. Looks like we need to redefine aqsloader permissions on the re-created tables:

...
	Suppressed: java.io.IOException: Failed to write statements to commons.top_edited_categories_monthly. The
latest exception was
  User aqsloader has no MODIFY permission on <table commons.top_edited_categories_monthly> or any of its parents

Please check the executor logs for more exceptions and information

Done.

Confirmed data loading works fine with the new schema. Thank you @Eevans for all the help and improvements!

Thank you @Eevans and @xcollazo to handle this with such precision! I'm sorry I've been out these last 2 weeks.