Consider what mechanism to use to make date deletion possible
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	xcollazo
	May 9 2024, 7:56 PM

Description

In conversations with @JAllemandou, we have learnt that keeping an API like the Pageviews API over the course of many years has lead to some learnings on data retention. Mainly, keeping the data available forever has caused Cassandra performance issues. Attempting a data deletion now in such a database size is impractical, and the TTL feature of a Cassandra table can only be set at creation time.

Thus we should consider setting TTLs for all tables of Commons Impact Metrics before we launch.

In this ticket we should:

Wait until we have a final allow-list. Do a full run of all the ETL all the way to Cassandra, and see how many rows/data size we get for a couple of backfilled months.
Extrapolate what the size in Cassandra would be for all our 14 tables over 1, 3, 5 years.
With this info, consider whether we'd need to set TTLs, and if so, at what duration.
If no TTL, consider what to do long term with data: should we delete? If so, when should that happen? After 1, 3, 5 years? Long term we want T366631.

Details

	Subject	Repo	Branch	Lines +/-
	Commons Impact Metrics: Reduce the top calculations from top 1000 to top 100.	analytics/refinery	master	+16 -16

Customize query in gerrit

	Title	Reference	Author	Source Branch	Dest Branch
	Commons Impact Metrics: Schema change	repos/sre/data-gateway!1	eevans	common_impact_schema	main

Customize query in GitLab

Related Objects
Search...

Status	Assigned	Task
Open	None	T358673 [Epic] Commons Impact Metrics Implementation
Resolved	xcollazo	T358707 [Commons Impact Metrics] Create Airflow job that formats and loads the data to Cassandra for AQS
Resolved	xcollazo	T364583 Consider what mechanism to use to make date deletion possible

Event Timeline

xcollazo created this task.May 9 2024, 7:56 PM

mforns moved this task from Incoming requests to Q4 23/24 on the Commons-Impact-Metrics board.May 14 2024, 3:12 PM

xcollazo updated the task description. (Show Details)May 17 2024, 4:02 PM

xcollazo mentioned this in T362697: Create Staging Cassandra tables for Commons Impact Metrics.

@JAllemandou could you elaborate on the performance issues?

One thing worth considering: Since TTLs are recorded at write-time, on a per-record basis, it makes changing your retention policy later very costly (probably prohibitively so).

An alternative: Everything in the post-Cassandra 3.x era supports range tombstones, meaning a DELETE FROM ... WHERE val < ? is possible, and very efficient. One tiny write, and Cassandra will (eventually) GC everything that is less-than, as part of normal compaction. This should work for any time-series datasets —like pageviews, and several of the Commons Input Metrics tables as they exist today (commons.category_metrics_snapshot for example). Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

So wouldn't the tombstone thingy work similarly? As in:

DELETE FROM top_viewed_categories_monthly WHERE year < 2020

Semantically, that DELETE is ok with our data model.

In T364583#9810082, @xcollazo wrote:
This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);
So wouldn't the tombstone thingy work similarly? As in:
DELETE FROM top_viewed_categories_monthly WHERE year < 2020
Semantically, that DELETE is ok with our data model.

No, because year & month are a part of the partition key; The category_scope, wiki, year, and month attributes are for all intents and purposes concatenated together. A range delete like this is something that has to be done against the (indexed) partition itself.

In T364583#9810103, @Eevans wrote:
In T364583#9810082, @xcollazo wrote:
This should work for any time-series datasets
...
Other CIM tables (like commons.top_viewed_categories_monthly) could probably be re-thought to make this possible as well.

All top_* tables have rows that belong to a particular year and month. Example:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);
So wouldn't the tombstone thingy work similarly? As in:
DELETE FROM top_viewed_categories_monthly WHERE year < 2020
Semantically, that DELETE is ok with our data model.
No, because year & month are a part of the partition key; The category_scope, wiki, year, and month attributes are for all intents and purposes concatenated together. A range delete like this is something that has to be done against the (indexed) partition itself.

To expand on this a bit: When I said early that your other tables could probably be reworked, moving the year and month into a partition keyed by (category_scope, wiki) was the first thing that came to mind. So like this:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);

The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

In T364583#9810625, @Eevans wrote:
...
To expand on this a bit: When I said early that your other tables could probably be reworked, moving the year and month into a partition keyed by (category_scope, wiki) was the first thing that came to mind. So like this:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);
The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.

ranks cardinality is 1000. That is, we calcualte a max of 1000 rows per (category_scope, wiki, year, month). category_scope can only be one of shallow or deep. So cardinality is 2. wiki can be any of our ~900+ wikis.

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

In T364583#9812922, @xcollazo wrote:
Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);
Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.

Actually, it could be a DATE type, and then it would only cost 32-bits.

In T364583#9812940, @xcollazo wrote:
In T364583#9812922, @xcollazo wrote:
Couldn't we add a dummy TIMESTAMP for DELETE purposes?

That is, for all top_* tables, we add:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    internal_dt         TIMESTAMP,                 <<<<<<<<<<<<<<<<
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);
Then the deletes could be like DELETE FROM top_viewed_categories_monthly WHERE internal_dt < '2020-01-01' since that column is not part of the PRIMARY KEY. Looks like that would cost us a 64-bit integer, but seems like it would solve the issue without modifications to the PRIMARY KEY, and no TTLs needed as well.
Actually, it could be a DATE type, and then it would only cost 32-bits.

Well, the way you have it here wouldn't work because that value isn't indexed. If you moved it into PRIMARY KEY (ala PRIMARY KEY ((category_scope, wiki, year, month), internal_dt, rank) for example), your query becomes possible, but it won't remove the full record. The DELETE is only going to operate against the contents of the partition.

In T364583#9812869, @xcollazo wrote:
In T364583#9810625, @Eevans wrote:
...
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki), year, month, rank)
);
The one obstacle I see in a situation like this is cardinality. Changing it like this means that we're growing the partition from being based on the number of unique rank attributes (for each (category_scope, wiki, year, month)), to 12 * num_years * num_ranks (assuming the number of unique ranks is relatively low, that's not much). The other side of the equation is distribution. Changing it like this means we we're shrinking the number of partitions from the number of unique (category_scope, wiki, year, month) tuples, to the number of unique (category_scope, wiki). If the latter is overly small, then distribution won't be as good.

Do you have some idea of the cardinality of these datasets?

If a change like this will fit your data, I think the schema is all we'd need to fix; All we'd be doing is changing the way Cassandra organizes the data, the schema insofar as your client code is concerned should work as-is.
ranks cardinality is 1000. That is, we calcualte a max of 1000 rows per (category_scope, wiki, year, month). category_scope can only be one of shallow or deep. So cardinality is 2. wiki can be any of our ~900+ wikis.

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Ok, so we'd have 1800 partitions (distributed over 24 nodes/dc) that are 5 * 12 * 1000 rows "wide"? 60000 per partition? That seems Ok.

An important element of this is that your queries all qualify year & month, so even if the partition runs wide and growth is unbounded, your queries aren't/don't.

I guess the real question is: If we assume we can do it this way (a range DELETE instead of a TTL), what would be the strategy for implementing retention based on this? If your plan is to retain data for 5 years, then for all intents and purposes, nothing needs happen until 5 years from now. At 5 years and 1 month, you'd want some process to kick off and remove the corresponding months worth of data, for each of those 1800 partitions (setting aside other tables for the time being). Presumably that's some purpose built piece of code. Do you write that now and test it 5 years from now? Create a ticket post-dated 5 years in the future? @JAllemandou is wise, perhaps they already considered all of this before suggesting a TTL. 😀

Or, maybe the existing tables that need to be culled provide the impetus to create the infrastructure for culling older records?

Super interesting that cassandra can now do a delete via Range! This could help us apply a deletion-policy if we sometimes decide to use one!

Oh! It's great that we can do range deletions from Cassandra!
I think with the deprecation of Druid, we are adopting Cassandra as our main serving layer, and we should be able to delete data from it given a time range.
This would help do idempotent reruns, and implement deletion policies.

I think TTL is convenient, because it's easy. But I'm not sure is the good paradigm for long lived data?
It has some problems, like:

If we backfill say 1 year of data, then all that data is going to be deleted at the same time.
If we re-run past data, then the updated data will live longer than more recent data.
If data retention circumstances change during the years the data is to live (we need to delete the data sooner, or we don't need to delete it at all), we can not adapt to them.

If possible, I would lean towards implementing a deletion pipeline instead of using TTLs. We could maybe find a way to automate that via dataset config store?

In T364583#9827071, @mforns wrote:

Oh! It's great that we can do range deletions from Cassandra!
I think with the deprecation of Druid, we are adopting Cassandra as our main serving layer, and we should be able to delete data from it given a time range.
This would help do idempotent reruns, and implement deletion policies.

I think TTL is convenient, because it's easy. But I'm not sure is the good paradigm for long lived data?
It has some problems, like:

If we backfill say 1 year of data, then all that data is going to be deleted at the same time.

If we re-run past data, then the updated data will live longer than more recent data.

For me, both of these fall under the umbrella of (what I perceive to be) a lack of flexibility with your loader. I say that because Cassandra will allow you to assign the write time on INSERT/UPDATE, even if the Spark-based loader will not.

I don't know if this is something we'd ever have the time to work on, but I would love to come up with something that utilized Cassandra's bulk-loading capabilities. At a high-level, your import jobs would write Cassandra SSTables (there is a high-level, CQL-based abstraction for this) and drop them into a directory where they would be picked up for bulk-import. This would be incredibly fast/efficient. To the point where we could probably lift the requirement to run batch jobs during off-peak hours.

Presumably the piece that created SSTable files from results is something we'd have to build —an opportunity to replace the existing loader— and our opportunity to support things like writetime, or to implement retention (think: a policy that for each new row inserted, creates a range-delete for everything <= {retention_period}).

If data retention circumstances change during the years the data is to live (we need to delete the data sooner, or we don't need to delete it at all), we can not adapt to them.

If possible, I would lean towards implementing a deletion pipeline instead of using TTLs. We could maybe find a way to automate that via dataset config store?

I ran a notebook with some simple calculations with this template for each query:

year_months = [
    '2023-11',
    '2023-12',
    '2024-01',
    '2024-02',
    '2024-03',
    '2024-04',
]

row_counts = []
for year_month in year_months:
    row_counts.append(spark.sql(
        f"""
        hql goes here with {year_month} substitution
        """
    ).count())

print(f'Row counts: {row_counts}')

avg_of_6_months = int(sum(row_counts)/len(row_counts))
print(f'avg of 6 months is {avg_of_6_months}')

for years in [1,3,5,10]:
    print(f'Extrapolated to {years} years: {avg_of_6_months * 2 * years}')

Here are the results for each HQL file:

load_cassandra_commons_category_metrics_snapshot.hql

        Row counts: [692604, 710374, 724316, 736466, 742635, 749294]
        avg of 6 months is 725948
        Extrapolated to 1 years: 1451896
        Extrapolated to 3 years: 4355688
        Extrapolated to 5 years: 7259480
        Extrapolated to 10 years: 14518960


load_cassandra_commons_edits_per_category_monthly.hql

        Row counts: [309134, 269443, 283016, 291633, 268675, 272541]
        avg of 6 months is 282407
        Extrapolated to 1 years: 564814
        Extrapolated to 3 years: 1694442
        Extrapolated to 5 years: 2824070
        Extrapolated to 10 years: 5648140


load_cassandra_commons_edits_per_user_monthly.hql

        Row counts: [18875, 17613, 18718, 21221, 21956, 19370]
        avg of 6 months is 19625
        Extrapolated to 1 years: 39250
        Extrapolated to 3 years: 117750
        Extrapolated to 5 years: 196250
        Extrapolated to 10 years: 392500


load_cassandra_commons_media_file_metrics_snapshot.hql

        Row counts: [4709315, 4748909, 5093376, 5044512, 5172978, 5217583]
        avg of 6 months is 4997778
        Extrapolated to 1 years: 9995556
        Extrapolated to 3 years: 29986668
        Extrapolated to 5 years: 49977780
        Extrapolated to 10 years: 99955560


load_cassandra_commons_pageviews_per_category_monthly.hql

        Row counts: [5006153, 5109651, 5352705, 5336437, 5444292, 5559445]
        avg of 6 months is 5301447
        Extrapolated to 1 years: 10602894
        Extrapolated to 3 years: 31808682
        Extrapolated to 5 years: 53014470
        Extrapolated to 10 years: 106028940


load_cassandra_commons_pageviews_per_media_file_monthly.hql

        Row counts: [16733675, 16916077, 18314231, 17941258, 18457091, 18784188]
        avg of 6 months is 17857753
        Extrapolated to 1 years: 35715506
        Extrapolated to 3 years: 107146518
        Extrapolated to 5 years: 178577530
        Extrapolated to 10 years: 357155060


load_cassandra_commons_top_edited_categories_monthly.hql

        Row counts: [3194, 3268, 3325, 3347, 3313, 3277]
        avg of 6 months is 3287
        Extrapolated to 1 years: 6574
        Extrapolated to 3 years: 19722
        Extrapolated to 5 years: 32870
        Extrapolated to 10 years: 65740


load_cassandra_commons_top_editors_monthly.hql

        Row counts: [318925, 292033, 312174, 313196, 320420, 304375]
        avg of 6 months is 310187
        Extrapolated to 1 years: 620374
        Extrapolated to 3 years: 1861122
        Extrapolated to 5 years: 3101870
        Extrapolated to 10 years: 6203740


load_cassandra_commons_top_pages_per_category_monthly.hql

        Row counts: [34974791, 35413543, 37946799, 38171738, 39525931, 41312096]
        avg of 6 months is 37890816
        Extrapolated to 1 years: 75781632
        Extrapolated to 3 years: 227344896
        Extrapolated to 5 years: 378908160
        Extrapolated to 10 years: 757816320


load_cassandra_commons_top_pages_per_media_file_monthly.hql

        Row counts: [34840815, 35214849, 38785148, 38879803, 40378732, 42142899]
        avg of 6 months is 38373707
        Extrapolated to 1 years: 76747414
        Extrapolated to 3 years: 230242242
        Extrapolated to 5 years: 383737070
        Extrapolated to 10 years: 767474140


load_cassandra_commons_top_viewed_categories_monthly.hql

        Row counts: [488551, 490853, 503195, 499155, 512470, 522686]
        avg of 6 months is 502818
        Extrapolated to 1 years: 1005636
        Extrapolated to 3 years: 3016908
        Extrapolated to 5 years: 5028180
        Extrapolated to 10 years: 10056360


load_cassandra_commons_top_viewed_media_files_monthly.hql

        Row counts: [31602943, 32007472, 33760839, 33350496, 34096329, 34792779]
        avg of 6 months is 33268476
        Extrapolated to 1 years: 66536952
        Extrapolated to 3 years: 199610856
        Extrapolated to 5 years: 332684760
        Extrapolated to 10 years: 665369520


load_cassandra_commons_top_wikis_per_category_monthly.hql

        Row counts: [4565435, 4656221, 4891394, 4867686, 4972388, 5083295]
        avg of 6 months is 4839403
        Extrapolated to 1 years: 9678806
        Extrapolated to 3 years: 29036418
        Extrapolated to 5 years: 48394030
        Extrapolated to 10 years: 96788060


load_cassandra_commons_top_wikis_per_media_file_monthly.hql

        Row counts: [12023300, 12166315, 13219980, 12895879, 13283237, 13565730]
        avg of 6 months is 12859073
        Extrapolated to 1 years: 25718146
        Extrapolated to 3 years: 77154438
        Extrapolated to 5 years: 128590730
        Extrapolated to 10 years: 257181460

xcollazo added a project: Data Products (Data Products Sprint 14).May 29 2024, 4:07 PM

xcollazo moved this task from Sprint Backlog to In Process on the Data Products (Data Products Sprint 14) board.

VirginiaPoundstone assigned this task to xcollazo.May 29 2024, 4:19 PM

VirginiaPoundstone triaged this task as Low priority.

We experimented with lowering the top_* tables from having at most 1000 rows per metric, to at most 100. Here are the results:

HQL	Top 1000 rows per metric	Top 100 rows per metric	Percentage of of original size
`load_cassandra_commons_top_edited_categories_monthly.hql`	Row counts: [3194, 3268, 3325, 3347, 3313, 3277] avg of 6 months is 3287 Extrapolated to 1 years: 6574 Extrapolated to 3 years: 19722 Extrapolated to 5 years: 32870 Extrapolated to 10 years: 65740	Row counts: [400, 400, 400, 400, 400, 400] avg of 6 months is 400 Extrapolated to 1 years: 800 Extrapolated to 3 years: 2400 Extrapolated to 5 years: 4000 Extrapolated to 10 years: 8000	12%
`load_cassandra_commons_top_editors_monthly.hql`	Row counts: [318925, 292033, 312174, 313196, 320420, 304375] avg of 6 months is 310187 Extrapolated to 1 years: 620374 Extrapolated to 3 years: 1861122 Extrapolated to 5 years: 3101870 Extrapolated to 10 years: 6203740	Row counts: [291713, 266949, 285735, 284058, 288400, 277710] avg of 6 months is 282427 Extrapolated to 1 years: 564854 Extrapolated to 3 years: 1694562 Extrapolated to 5 years: 2824270 Extrapolated to 10 years: 5648540	91%
`load_cassandra_commons_top_pages_per_category_monthly.hql`	Row counts: [34974791, 35413543, 37946799, 38171738, 39525931, 41312096] avg of 6 months is 37890816 Extrapolated to 1 years: 75781632 Extrapolated to 3 years: 227344896 Extrapolated to 5 years: 378908160 Extrapolated to 10 years: 757816320	Row counts: [22438356, 22801039, 24014958, 24118421, 24735507, 25429991] avg of 6 months is 23923045 Extrapolated to 1 years: 47846090 Extrapolated to 3 years: 143538270 Extrapolated to 5 years: 239230450 Extrapolated to 10 years: 478460900	63%
`load_cassandra_commons_top_pages_per_media_file_monthly.hql`	Row counts: [34840815, 35214849, 38785148, 38879803, 40378732, 42142899] avg of 6 months is 38373707 Extrapolated to 1 years: 76747414 Extrapolated to 3 years: 230242242 Extrapolated to 5 years: 383737070 Extrapolated to 10 years: 767474140	Row counts: [31705607, 32066656, 34723326, 34306352, 35381242, 36337743] avg of 6 months is 34086821 Extrapolated to 1 years: 68173642 Extrapolated to 3 years: 204520926 Extrapolated to 5 years: 340868210 Extrapolated to 10 years: 681736420	89%
`load_cassandra_commons_top_viewed_categories_monthly.hql`	Row counts: [488551, 490853, 503195, 499155, 512470, 522686] avg of 6 months is 502818 Extrapolated to 1 years: 1005636 Extrapolated to 3 years: 3016908 Extrapolated to 5 years: 5028180 Extrapolated to 10 years: 10056360	Row counts: [112858, 112765, 115443, 114384, 116975, 118395] avg of 6 months is 115136 Extrapolated to 1 years: 230272 Extrapolated to 3 years: 690816 Extrapolated to 5 years: 1151360 Extrapolated to 10 years: 2302720	23%
`load_cassandra_commons_top_viewed_media_files_monthly.hql`	Row counts: [31602943, 32007472, 33760839, 33350496, 34096329, 34792779] avg of 6 months is 33268476 Extrapolated to 1 years: 66536952 Extrapolated to 3 years: 199610856 Extrapolated to 5 years: 332684760 Extrapolated to 10 years: 665369520	Row counts: [21145918, 21492724, 22515085, 22434407, 22864253, 23289161] avg of 6 months is 22290258 Extrapolated to 1 years: 44580516 Extrapolated to 3 years: 133741548 Extrapolated to 5 years: 222902580 Extrapolated to 10 years: 445805160	67%
`load_cassandra_commons_top_wikis_per_category_monthly.hql`	Row counts: [4565435, 4656221, 4891394, 4867686, 4972388, 5083295] avg of 6 months is 4839403 Extrapolated to 1 years: 9678806 Extrapolated to 3 years: 29036418 Extrapolated to 5 years: 48394030 Extrapolated to 10 years: 96788060	Row counts: [4411351, 4501125, 4727575, 4703233, 4798316, 4897946] avg of 6 months is 4673257 Extrapolated to 1 years: 9346514 Extrapolated to 3 years: 28039542 Extrapolated to 5 years: 46732570 Extrapolated to 10 years: 93465140	97%
`load_cassandra_commons_top_wikis_per_media_file_monthly.hql`	Row counts: [12023300, 12166315, 13219980, 12895879, 13283237, 13565730] avg of 6 months is 12859073 Extrapolated to 1 years: 25718146 Extrapolated to 3 years: 77154438 Extrapolated to 5 years: 128590730 Extrapolated to 10 years: 257181460	Row counts: [12002961, 12145673, 13198348, 12871148, 13256429, 13532457] avg of 6 months is 12834502 Extrapolated to 1 years: 25669004 Extrapolated to 3 years: 77007012 Extrapolated to 5 years: 128345020 Extrapolated to 10 years: 256690040	100%

Analysis:

The tables that tended to benefit greatly from the reduction were also the ones with the least rows from the beginning.
Over 5 years we'd have a total of 1,280,477,670 rows vs 982,058,460, yielding ~77% of the original rows.

Conclusions: We are better off deleting data than reducing from Top 1000 to Top 100, however, if we did nothing else, this would still save ~300M rows. So I will go ahead with the change to tame a bit the row explosion.

Change #1037167 had a related patch set uploaded (by Xcollazo; author: Xcollazo):

[analytics/refinery@master] Commons Impact Metrics: Reduce the top calculations from top 1000 to top 100.

https://gerrit.wikimedia.org/r/1037167

Change #1037167 merged by Xcollazo:

[analytics/refinery@master] Commons Impact Metrics: Reduce the top calculations from top 1000 to top 100.

https://gerrit.wikimedia.org/r/1037167

Maintenance_bot removed a project: Patch-For-Review.May 29 2024, 8:31 PM

xcollazo added a comment.May 29 2024, 8:31 PM

This comment was removed by xcollazo.

@Eevans regarding us having a deletion mechanism:

Going back to our running example:

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);

Since we have the original tables in the Datalake, we can calculate what is the MIN() and MAX() of all the columns that are part of the PRIMARY KEY:

column	`MIN()`	`MAX()`
`category_scope`	`deep`	`shallow`
`wiki`	`aawiki`	`zzwiki`
`year`	`2024`	`2024`
`month`	`0`	`12`

So, if we put together a DELETE stament like the following:

DELETE
FROM commons.top_viewed_categories_monthly
 WHERE category_scope >= 'deep'
   AND category_scope <= 'shallow'
   AND wiki >= 'aawiki'
   AND wiki <= 'zzwiki'
   AND year >= 2024
   AND year <= 2024
   AND month >= 0
   AND month <= 12

Would this make Cassandra happy, and hopefully be fast?

xcollazo updated the task description. (Show Details)May 29 2024, 10:03 PM

In T364583#9844096, @xcollazo wrote:
@Eevans regarding us having a deletion mechanism:

Going back to our running example:
CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month), rank)
);
Since we have the original tables in the Datalake, we can calculate what is the MIN() and MAX() of all the columns that are part of the PRIMARY KEY:

column MIN() MAX()

category_scope deep shallow

wiki aawiki zzwiki

year 2024 2024

month 0 12

So, if we put together a DELETE stament like the following:
DELETE
FROM commons.top_viewed_categories_monthly
 WHERE category_scope >= 'deep'
   AND category_scope <= 'shallow'
   AND wiki >= 'aawiki'
   AND wiki <= 'zzwiki'
   AND year >= 2024
   AND year <= 2024
   AND month >= 0
   AND month <= 12
Would this make Cassandra happy, and hopefully be fast?

No, this won't work. Everything inside those inner parens (e.g. (category_scope, wiki, year, month)) is combined to form the partition key. You can't do anything but a by-value lookup of that partition key, which in turn requires specifying each constituent value (i.e. no ranges).

In T364583#9813640, @Eevans wrote:

In T364583#9812869, @xcollazo wrote:

[ ... ]

Calculating over 5 years of data:

Unique (category_scope, wiki, year, month) should be about 2 * 900 * 5 * 12 = 108,000.

Unique (category_scope, wiki) should be about 2 * 900 = 1,800.

So definitely a stark contrast for this particular table.

Ok, so we'd have 1800 partitions (distributed over 24 nodes/dc) that are 5 * 12 * 1000 rows "wide"? 60000 per partition? That seems Ok.

An important element of this is that your queries all qualify year & month, so even if the partition runs wide and growth is unbounded, your queries aren't/don't.

I guess the real question is: If we assume we can do it this way (a range DELETE instead of a TTL), what would be the strategy for implementing retention based on this? If your plan is to retain data for 5 years, then for all intents and purposes, nothing needs happen until 5 years from now. At 5 years and 1 month, you'd want some process to kick off and remove the corresponding months worth of data, for each of those 1800 partitions (setting aside other tables for the time being). Presumably that's some purpose built piece of code. Do you write that now and test it 5 years from now? Create a ticket post-dated 5 years in the future? @JAllemandou is wise, perhaps they already considered all of this before suggesting a TTL. 😀

Or, maybe the existing tables that need to be culled provide the impetus to create the infrastructure for culling older records?

Going back to this comment: It seems like it'd be OK (at least in this particular example you provided numbers for), to move year & month out of the partition key (so that they'd be indexed for range queries). I say 'OK' only because the distribution isn't as good (we'd have fewer, wider partitions), but the dimensions are still within reason. If it provided the means to implement more flexible retention, I'd call that a win. That's really the only concern I think, implementing that retention.

The way the schema is right now, the dataset can grow unbounded, yes, but other than storage resources used, it presents no concern. If we changed it, and moved year and month out of the partition key, then partitions will grow unbounded. It would probably take many years (many more than 5), but on a long enough timeline, that will eventually cause problems. So my (relatively minor) concern would be kicking the can so far down the road that nothing is ever done, and that one day our successors look back on us with contempt. 😀 In other words, we should probably make reasonable efforts to make sure that —if we do this— we make some reasonable effort to implement the deletes. I think we have some more immediate motivation to do so too, @JAllemandou has indicated a desire to cull records from pageviews, and the schema there already permits this. Anything we created could (nay should) be generic enough to accommodate all of these.

It would probably take many years (many more than 5), but on a long enough timeline, that will eventually cause problems. So my (relatively minor) concern would be kicking the can so far down the road that nothing is ever done, and that one day our successors look back on us with contempt.

Fair enough.

I think from the point of view of this task though, we are in agreement that:

TTLs are not the best mechanism to deal with this.
There is no immediate concern, even considering the row numbers shared on T364583#9843977.

Thus, I am closing this task, and opening a separate one to deal with a more generic solution to DELETE data from Cassandra.

xcollazo moved this task from In Process to Done on the Data Products (Data Products Sprint 14) board.Tue, Jun 4, 2:57 PM

Thus, I am closing this task, and opening a separate one to deal with a more generic solution to DELETE data from Cassandra.

Opened T366631.

xcollazo updated the task description. (Show Details)Tue, Jun 4, 5:53 PM

Reopening to consider changing PRIMARY KEYs on the top_* tables given @Eevans comment on T366631:

In T366631#9861232, @Eevans wrote:

Just for posterity sake:

Some of the Commons Impact Metrics tables already accommodate doing range deletes. Those that do not could be made to do so by moving year and month attributes into the composite key. This would make for fewer —wider— partitions, but not unacceptably so (at least based on the example discussed in T364583). Using range deletes is cheap, because you're writing a single tombstone to remove an entire range of values. As far as I can tell, this change to schema could be made without having to update any of the affected code (i.e. it changes nothing, except for how Cassandra organizes the data). It is not a change that can be made once data is in place though. In other words, now seems like an ideal time; Doing this later would require migration.

Point deletes will work too, though I suspect the tooling will end up being more bespoke, than something range based. It's also less efficient/elegant; It means a doubling of the number of transactions (however many are inserted in a day, would be the number of deletes to remove everything 5 years + 1 days old).

xcollazo moved this task from Done to In Process on the Data Products (Data Products Sprint 14) board.Wed, Jun 5, 3:33 PM

Here are the discrete cardinalities of the columns that are part of the PRIMARY KEYs of the top_* tables from Commons Impact Metrics:

cardinality(category) = 749294, will grow.
cardinality(category_scope) = 2
cardinality(wiki) = ~900, will grow slowly
cardinality(year) = ~5
cardinality (month) = 12
cardinality(rank) = 100
cardinality(media_file) = 5217583, will grow
cardinality(edit_type) = 3

Here is a table with the suggested PRIMARY KEY changes and their respective full PRIMARY KEY cardinalities:

Table name	Old PRIMARY KEY cardinality	New PRIMARY KEY cardinality
commons.top_pages_per_category_monthly	cardinality(category, category_scope, wiki, year, month) = 7492942900512 = 80,923,752,000	cardinality(category, category_scope, wiki) = 7492942900 = 1,348,729,200
commons.top_wikis_per_category_monthly	cardinality(category, category_scope, year, month) = 74929425*12 = 89,915,280	cardinality(category, category_scope)=749294*2 = 1,498,588
commons.top_viewed_categories_monthly	cardinality(category_scope, wiki, year, month)= 29005*12 = 108,000	cardinality(category_scope, wiki)= 2*900 = 1,800
commons.top_pages_per_media_file_monthly	cardinality(media_file, wiki, year, month) = 52175839005*12 = 281,749,482,000	cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700
commons.top_wikis_per_media_file_monthly	cardinality(media_file, year, month) = 5217583512 = 313,054,980	cardinality(media_file) = 5217583
commons.top_viewed_media_files_monthly	cardinality(category, category_scope, wiki, year, month) = 7492942900512 = 80,923,752,000	cardinality(category, category_scope, wiki) = 7492942900 = 1,348,729,200
commons.top_edited_categories_monthly	cardinality(category_scope, edit_type, year, month) = 235*12 = 360	cardinality(category_scope, edit_type) = 2*3 = 6
commons.top_editors_monthly	cardinality(category, category_scope, edit_type, year, month) = 74929423512 = 269,745,840	cardinality(category, category_scope, edit_type) = 74929423 = 4,495,764

Here is how the new DDL would look for all these top_* tables:

CREATE TABLE commons.top_pages_per_category_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    page_wiki           VARCHAR,
    page_title          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki), year, month, rank)
);

CREATE TABLE commons.top_wikis_per_category_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope), year, month, rank)
);

CREATE TABLE commons.top_viewed_categories_monthly (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    category            VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki) year, month, rank)
);

CREATE TABLE commons.top_pages_per_media_file_monthly (
    media_file          VARCHAR,
    wiki                VARCHAR,
    page_wiki           VARCHAR,
    page_title          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file, wiki) year, month, rank)
);

CREATE TABLE commons.top_wikis_per_media_file_monthly (
    media_file          VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file) year, month, rank)
);

CREATE TABLE commons.top_viewed_media_files_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    media_file          VARCHAR,
    pageview_count      BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki) year, month, rank)
);

CREATE TABLE commons.top_edited_categories_monthly (
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    category            VARCHAR,
    edit_count          BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, edit_type) year, month, rank)
);

CREATE TABLE commons.top_editors_monthly (
    category            VARCHAR,
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    user_name           VARCHAR,
    edit_count          BIGINT,
    rank                INT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, edit_type) year, month, rank)
);

@Eevans please take a look at the cardinality numbers on T364583#9864290, and if you think they look ok, let's go for it and change them now rather than later, as you suggest.

In T364583#9864299, @xcollazo wrote:

@Eevans please take a look at the cardinality numbers on T364583#9864290, and if you think they look ok, let's go for it and change them now rather than later, as you suggest.

Could you update that table to include the number of columns per partition, in each?

Could you update that table to include the number of columns per partition, in each?

I think you meant rows? In that case, we have about 6,000 rows per partition:

Table name	Old PRIMARY KEY cardinality	New PRIMARY KEY cardinality	New CLUSTERING cardinality
commons.top_pages_per_category_monthly	cardinality(category, category_scope, wiki, year, month) = 7492942900512 = 80,923,752,000	cardinality(category, category_scope, wiki) = 7492942900 = 1,348,729,200	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_category_monthly	cardinality(category, category_scope, year, month) = 74929425*12 = 89,915,280	cardinality(category, category_scope)=749294*2 = 1,498,588	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_categories_monthly	cardinality(category_scope, wiki, year, month)= 29005*12 = 108,000	cardinality(category_scope, wiki)= 2*900 = 1,800	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_pages_per_media_file_monthly	cardinality(media_file, wiki, year, month) = 52175839005*12 = 281,749,482,000	cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_wikis_per_media_file_monthly	cardinality(media_file, year, month) = 5217583512 = 313,054,980	cardinality(media_file) = 5217583	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_viewed_media_files_monthly	cardinality(category, category_scope, wiki, year, month) = 7492942900512 = 80,923,752,000	cardinality(category, category_scope, wiki) = 7492942900 = 1,348,729,200	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_edited_categories_monthly	cardinality(category_scope, edit_type, year, month) = 235*12 = 360	cardinality(category_scope, edit_type) = 2*3 = 6	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000
commons.top_editors_monthly	cardinality(category, category_scope, edit_type, year, month) = 74929423512 = 269,745,840	cardinality(category, category_scope, edit_type) = 74929423 = 4,495,764	cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

In T364583#9865146, @xcollazo wrote:

Could you update that table to include the number of columns per partition, in each?

I think you meant rows? In that case, we have about 6,000 rows per partition:

Table name Old PRIMARY KEY cardinality New PRIMARY KEY cardinality New CLUSTERING cardinality

commons.top_pages_per_category_monthly cardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000 cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_wikis_per_category_monthly cardinality(category, category_scope, year, month) = 749294*2*5*12 = 89,915,280 cardinality(category, category_scope)=749294*2 = 1,498,588 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_viewed_categories_monthly cardinality(category_scope, wiki, year, month)= 2*900*5*12 = 108,000 cardinality(category_scope, wiki)= 2*900 = 1,800 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_pages_per_media_file_monthly cardinality(media_file, wiki, year, month) = 5217583*900*5*12 = 281,749,482,000 cardinality(media_file, wiki) = 5217583*900 = 4,695,824,700 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_wikis_per_media_file_monthly cardinality(media_file, year, month) = 5217583*5*12 = 313,054,980 cardinality(media_file) = 5217583 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_viewed_media_files_monthly cardinality(category, category_scope, wiki, year, month) = 749294*2*900*5*12 = 80,923,752,000 cardinality(category, category_scope, wiki) = 749294*2*900 = 1,348,729,200 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_edited_categories_monthly cardinality(category_scope, edit_type, year, month) = 2*3*5*12 = 360 cardinality(category_scope, edit_type) = 2*3 = 6 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

commons.top_editors_monthly cardinality(category, category_scope, edit_type, year, month) = 749294*2*3*5*12 = 269,745,840 cardinality(category, category_scope, edit_type) = 749294*2*3 = 4,495,764 cardinality(year, month, rank) = 5 * 12 * 100 = 6,000

These are totally fine, we should definitely do this.

Are you OK to have the existing tables removed (along with any data you've already loaded), and recreated? For both clusters, aqs and cassandra-dev?

Are you OK to have the existing tables removed (along with any data you've already loaded), and recreated? For both clusters, aqs and cassandra-dev?

Yes and yes.

eevans opened https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/1

Commons Impact Metrics: Schema change

eevans merged https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/1

Commons Impact Metrics: Schema change

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

Maintenance_bot removed a project: Patch-For-Review.Wed, Jun 5, 7:31 PM

Updated design doc for completeness: https://docs.google.com/document/d/1sWPJzO9J6nwfhzJAwrbWZQBcvwiT8UhBBfn2kFW3pNQ/edit

Scott_French mentioned this in T364921: Commons Impact Metrics: Data Gateway endpoints.Wed, Jun 5, 11:06 PM

JAllemandou awarded a token.Thu, Jun 6, 10:20 AM

xcollazo renamed this task from Consider whether we want to set a TTL on the Cassandra tables to Consider what mechanism to use to make date deletion possible.Thu, Jun 6, 4:24 PM

xcollazo updated the task description. (Show Details)

In T364583#9865420, @Eevans wrote:

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

I'm having trouble testing data loading on the Cassandra Staging cluster. Looks like we need to redefine aqsloader permissions on the re-created tables:

...
	Suppressed: java.io.IOException: Failed to write statements to commons.top_edited_categories_monthly. The
latest exception was
  User aqsloader has no MODIFY permission on <table commons.top_edited_categories_monthly> or any of its parents

Please check the executor logs for more exceptions and information

In T364583#9868782, @xcollazo wrote:
In T364583#9865420, @Eevans wrote:

Ok, the schema has been updated on both clusters (any data already loaded will need to be reloaded).

I'm having trouble testing data loading on the Cassandra Staging cluster. Looks like we need to redefine aqsloader permissions on the re-created tables:
...
	Suppressed: java.io.IOException: Failed to write statements to commons.top_edited_categories_monthly. The
latest exception was
  User aqsloader has no MODIFY permission on <table commons.top_edited_categories_monthly> or any of its parents

Please check the executor logs for more exceptions and information

Done.

Confirmed data loading works fine with the new schema. Thank you @Eevans for all the help and improvements!

xcollazo moved this task from In Process to Done on the Data Products (Data Products Sprint 14) board.Thu, Jun 6, 7:29 PM

Thank you @Eevans and @xcollazo to handle this with such precision! I'm sorry I've been out these last 2 weeks.

mforns mentioned this in T358722: [Commons Impact Metrics] Create API documentation.Thu, Jun 13, 3:12 PM

VirginiaPoundstone closed this task as Resolved.Thu, Jun 20, 8:43 PM

Consider what mechanism to use to make date deletion possibleClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Consider what mechanism to use to make date deletion possible
Closed, ResolvedPublic
Actions

Related Objects
Search...