Page MenuHomePhabricator

Create Cassandra tables for Commons Impact Metrics
Open, Needs TriagePublic

Description

Part of T358707.

In this task, we want to get tables created in the production Cassandra instance so that we can start testing our data loading HQL.

Context:

  • Commons Impact Metrics is an effort to support the GLAM community see the impact of their contributions.
  • We have a set of 6 Iceberg tables that live in the Datalake.
  • These 6 tables will be transformed into 14 Cassandra tables optimized for serving the data. (Serving Layer Design)
  • A set of 14 AQS endpoints will receive HTTP requests, query Cassandra and serve the data. (API design)

As per the Serving Layer Design, we have drafted the Cassandra DDL below.

CREATE KEYSPACE commons
    WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' }; -- this needs to be changed to whatever prod settings we typically use.


CREATE TABLE commons.category_metrics_snapshot (
    category                   VARCHAR,
    media_file_count           BIGINT,
    media_file_count_deep      BIGINT,
    used_media_file_count      BIGINT,
    used_media_file_count_deep BIGINT,
    leveraging_wiki_count      BIGINT,
    leveraging_wiki_count_deep BIGINT,
    leveraging_page_count      BIGINT,
    leveraging_page_count_deep BIGINT,
    month                      TIMESTAMP,
    PRIMARY KEY (category, month)
);

CREATE TABLE commons.media_file_metrics_snapshot (
    media_file                 VARCHAR,
    leveraging_wiki_count      BIGINT,
    leveraging_page_count      BIGINT,
    month                      TIMESTAMP,
    PRIMARY KEY (media_file, month)
);


CREATE TABLE commons.pageviews_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    month               TIMESTAMP,
    PRIMARY KEY ((category, category_scope, wiki), month)
);


CREATE TABLE commons.pageviews_by_media_file (
    media_file          VARCHAR,
    wiki                VARCHAR,
    pageview_count      BIGINT,
    month               TIMESTAMP,
    PRIMARY KEY ((media_file, wiki), month)
);


CREATE TABLE commons.edits_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    edit_count          BIGINT,
    month               TIMESTAMP,
    PRIMARY KEY ((category, category_scope, edit_type), month)
);


CREATE TABLE commons.edits_by_user (
    user_name           VARCHAR,
    edit_type           VARCHAR,
    edit_count          BIGINT,
    month               TIMESTAMP,
    PRIMARY KEY ((user_name, edit_type), month)
);


CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki, year, month))
);


CREATE TABLE commons.top_wikis_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, year, month))
);


CREATE TABLE commons.top_viewed_categories (
    category_scope      VARCHAR,
    wiki                VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, wiki, year, month))
);


CREATE TABLE commons.top_pages_by_media_file (
    media_file          VARCHAR,
    wiki                VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file, wiki, year, month))
);


CREATE TABLE commons.top_wikis_by_media_file (
    media_file          VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((media_file, year, month))
);


CREATE TABLE commons.top_viewed_media_files (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, wiki, year, month))
);


CREATE TABLE commons.top_edited_categories (
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category_scope, edit_type, year, month))
);


CREATE TABLE commons.top_editors (
    category            VARCHAR,
    category_scope      VARCHAR,
    edit_type           VARCHAR,
    top_data            TEXT,
    year                INT,
    month               INT,
    PRIMARY KEY ((category, category_scope, edit_type, year, month))
);

Event Timeline

xcollazo added a subscriber: Eevans.

@Eevans I believe you are the owner of the production Cassandra instance.

Kindly please take a look at this ticket, and let us know your questions and comments.

Eevans moved this task from Next to Backlog on the Cassandra board.
Eevans moved this task from Backlog to Next on the Cassandra board.

@Eevans I believe you are the owner of the production Cassandra instance.

Data-Persistence manages the cluster, yes!

Kindly please take a look at this ticket, and let us know your questions and comments.

First off, thanks for such a detailed request. I realize you probably didn't work all of this up strictly for our benefit, but it's nonetheless very cool to see something so thorough and well thought out.

Let me start off with a general question: What does your timeline look like? It sounds like you want the tables created first so that you can start working on data loading, and that the service endpoints will come sometime after, correct? When do you need the tables, and when do expect to build the services? When are you looking to launch all of this?

For the testing phase, are you OK with using the cassandra-dev environment? It's a smaller cluster (3 nodes in codfw), that historically we've used to stage Cassandra changes, but that we've more recently hoped to start using for application testing/staging as well (see: T355920). It's safer place to experiment with, and we're in the process of setting it up to allow developer access to run cqlsh (see: T355730).

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:

CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    year                INT,
    month               INT,
    page_title          text,
    page_view_count     int,
    rank                int,
    PRIMARY KEY ((category, category_scope, wiki, year, month), page_title)
);

The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

[ ... ]

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

P.S. For clarification: The HTTP Gateway service would have no bearing on how data is loaded, only in serving reads.

@Eevans thank you for the extensive comments, and apologies for taking long to respond.

@Eevans I believe you are the owner of the production Cassandra instance.

Data-Persistence manages the cluster, yes!

Kindly please take a look at this ticket, and let us know your questions and comments.

First off, thanks for such a detailed request. I realize you probably didn't work all of this up strictly for our benefit, but it's nonetheless very cool to see something so thorough and well thought out.

Kudos to @mforns. I'm just executing the spec, but he wrote it.

Let me start off with a general question: What does your timeline look like? It sounds like you want the tables created first so that you can start working on data loading, and that the service endpoints will come sometime after, correct?

Correct.

When do you need the tables, and when do expect to build the services? When are you looking to launch all of this?

We'd want to have the tables up ASAP, as work on the AQS endpoints is currently blocked by this.

We are hoping to go to prod by end of May.

For the testing phase, are you OK with using the cassandra-dev environment? It's a smaller cluster (3 nodes in codfw), that historically we've used to stage Cassandra changes, but that we've more recently hoped to start using for application testing/staging as well (see: T355920). It's safer place to experiment with, and we're in the process of setting it up to allow developer access to run cqlsh (see: T355730).

That would actually be ideal for me. Happy to see that we are working towards an integration -> staging -> production pipeline. I noticed though that the referenced ticket are still open. Is this something I could use right away?

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

I'd argue that for all the values that store pageview counts, I'd be safer to keep as bigint considering we do a bunch of rollups and pageview counts can be very big numbers. But agreed that many, like category_metrics_snapshot's media_file_count, could very well be ints. @mforns WDYT?

Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:

CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    year                INT,
    month               INT,
    page_title          text,
    page_view_count     int,
    rank                int,
    PRIMARY KEY ((category, category_scope, wiki, year, month), page_title)
);

The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

I'll let @mforns expand/refute, but I think the rationale for encoding this inside the top_data was to have point queries and thus avoid range queries like the one you propose. If you believe Cassandra would be happy with range queries that could have up to 1000 rows, perhaps we should reconsider the schema indeed.

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

Ah man, I wish we knew about this sooner! @mforns (and @Milimetric and @SGupta-WMF) , I think this goes back to our conversation on "why don't we just let AI write all this boiler plate code?". @Eevans where can we go read more about this mechanism?

Yes @xcollazo I also wish I knew about this sooner.
And thanks @Eevans for the careful review!

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

I'd argue that for all the values that store pageview counts, I'd be safer to keep as bigint considering we do a bunch of rollups and pageview counts can be very big numbers. But agreed that many, like category_metrics_snapshot's media_file_count, could very well be ints. @mforns WDYT?

Agree! Everything that's not pageviews can be INT. I believe media file counts per category are not going over 1M so far, so we'd have quite some slack there. I see some pageview counts that reach 30+B though, so we need a BIGINT there.

Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:
...
The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

I'll let @mforns expand/refute, but I think the rationale for encoding this inside the top_data was to have point queries and thus avoid range queries like the one you propose. If you believe Cassandra would be happy with range queries that could have up to 1000 rows, perhaps we should reconsider the schema indeed.

What Xabriel says! I thought that was better for Cassandra. But I can see it's the right thing to do. My bad. I think we should bring this up with the team, since it will represent some rework.

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

Ah man, I wish we knew about this sooner! @mforns (and @Milimetric and @SGupta-WMF) , I think this goes back to our conversation on "why don't we just let AI write all this boiler plate code?". @Eevans where can we go read more about this mechanism?

Yea. I should have reached out to you @Eevans when designing the Cassandra tables 1 month ago. @Milimetric and @SGupta-WMF, what do you think, how much of a rework would that be? Would it be worth? Let's discuss with the team as well.

@Eevans thank you for the extensive comments, and apologies for taking long to respond.

[ ... ]
Second, a number of these tables use an attribute called top_data that is a JSON-encoded blob. Unless there is a very good reason, I would strongly discourage doing this. That serialized value is a part of the data model, a part that has been elided from this schema. As far as I can tell, each example of this could be modeled entirely in Cassandra to produce the same results. Take commons.top_pages_by_category as an example. If you change the table to this:

CREATE TABLE commons.top_pages_by_category (
    category            VARCHAR,
    category_scope      VARCHAR,
    wiki                VARCHAR,
    year                INT,
    month               INT,
    page_title          text,
    page_view_count     int,
    rank                int,
    PRIMARY KEY ((category, category_scope, wiki, year, month), page_title)
);

The same query —using predicates for category, category_scope, wiki, year, and month— will produce the results that you are JSON encoding.

I'll let @mforns expand/refute, but I think the rationale for encoding this inside the top_data was to have point queries and thus avoid range queries like the one you propose. If you believe Cassandra would be happy with range queries that could have up to 1000 rows, perhaps we should reconsider the schema indeed.

It's really no different from a Cassandra perspective. All of that data is going to be stored contiguously, and will need to be read from storage on every read; Encoding it separately like this won't accomplish anything. If anything, the database can page though a large result set (if needed), but an encoded blob like this will need to be read in its entirety before returning to the client.

On top of all the usual reasons this is a bad idea (and a potential source of bugs), in this case it will also make things weird for the HTTP Gateway if we decide to go that route (which the more I think about, the more I feel we should). Instead of getting back fully encoded JSON results, you'll get JSON objects that contain an escaped string that will need to be de-serialized (another potential source of bugs).

Finally, something for you to consider: For image suggestions we created something we're calling Cassandra HTTP Gateway. The idea is that if you are persisting results exactly as you hope to return them, right down to attribute names, then we could give you an HTTP interface (REST) to retrieve them from. This wouldn't be a substitute for your service, but your service could make a simple HTTP request rather than directly query Cassandra. This should make things much simpler for you. The gateway returns JSON-encoded results as an object with a single attribute called rows, that is an array of JSON-encoded row objects ({rows: [{...}, {...}, ]}). For your case, we'd be able to set that up so that you could plug that array right into items: [] in your response objects. Let me know what you think!

Ah man, I wish we knew about this sooner! @mforns (and @Milimetric and @SGupta-WMF) , I think this goes back to our conversation on "why don't we just let AI write all this boiler plate code?". @Eevans where can we go read more about this mechanism?

It's written about at a high-level here: https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway (but I'm not sure that it adds anything to the description above).

Basically, we'd put you together the gateway service (running in k8s). You'd have endpoints for each of the queries you'd otherwise make to Cassandra, for example:

GET /public/commons-analytics/category-metrics-snapshot/<category>/<start>/<end>

Would return something like:

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: xxxx
Date: xxxxxxx

{
  "rows": [
    {
      "timestamp": "2024010100",
      "media_file_count": 12345,
      "media_file_count_deep": 1234567,
      "used_media_file_count": 453,
      "used_media_file_count_deep": 4536,
      "leveraging_wiki_count": 12,
      "leveraging_wiki_count_deep": 120,
      "leveraging_page_count": 47,
      "leveraging_page_count_deep": 417
    },
    {
      "timestamp": "2024020100",
      "media_file_count": 2345,
      "media_file_count_deep": 234567,
      "used_media_file_count": 53,
      "used_media_file_count_deep": 536,
      "leveraging_wiki_count": 2,
      "leveraging_wiki_count_deep": 20,
      "leveraging_page_count": 7,
      "leveraging_page_count_deep": 17
   },

    ...
  ]
}

The gateway will return one of two HTTP statuses, 200 or 500. If it's a 500, you get an HTTP problem with the details of the underlying error (which shouldn't happen unless the DB is broken).

And that's pretty much it. You can dispense with all of the code for setting up and connecting to the database, authentication, executing queries. You'll never have to upgrade a driver. It should be much simpler.

Yes @xcollazo I also wish I knew about this sooner.
And thanks @Eevans for the careful review!

As for the more technical deets, I have a few things here as well:

First, a lot of the metric values in your proposed schema use bigint, is a 64 bit signed integer warranted here, or would a 32 bit (int) value work?

I'd argue that for all the values that store pageview counts, I'd be safer to keep as bigint considering we do a bunch of rollups and pageview counts can be very big numbers. But agreed that many, like category_metrics_snapshot's media_file_count, could very well be ints. @mforns WDYT?

Agree! Everything that's not pageviews can be INT. I believe media file counts per category are not going over 1M so far, so we'd have quite some slack there. I see some pageview counts that reach 30+B though, so we need a BIGINT there.

Great; Can you update the schema (in the description) accordingly?

@Eevans yes, of course.
Let's freeze this task for now.
We will meet as a team, since these changes would require some rework.
We'll decide, and we'll get back to you with all the changes to this task and the design doc.
🙏

@Eevans yes, of course.
Let's freeze this task for now.
We will meet as a team, since these changes would require some rework.
We'll decide, and we'll get back to you with all the changes to this task and the design doc.
🙏

A few things in summary:

First, I hope this doesn't require much rework, but I think we'd need a pretty compelling reason to proceed the way it is. I know this sort of thing was done in the past —there is precedent— but it shouldn't have, and while I understand the hour may be late (more on that below), it's not too late to fix this one.

Second, the approach taken with the HTTP Data Gateway is still considered experimental, so we should definitely try to take the time now to reflect on whether it still makes sense. However, this is the exact use-case for which we undertook that, so not using here would be implicitly terminating that experiment, and I see no justification for that. In other words, unless there is a compelling reason to do otherwise, we should plan on having the service interface the Data Gateway instead of talking to Cassandra directly. As I mentioned above, this should make things much simpler for your services. To my preceding point though, that blob containing serialized JSON is going to interfere with how that's meant to work.

Finally, I think we can get all of this done pretty quickly, but a timeline that gets you to production by the end of May doesn't seem realistic, not because of the amount of work per say, but because of all the time between now and then that is already spoken for.

Let me know, and feel free to grab some time on my calendar if you'd rather discuss any of this synchronously!

Since it is experimental, I do not think we should go in the HTTP Data Gateway direction at this time.

@WDoranWMF is this interesting to thinking through as a future release?

Since it is experimental, I do not think we should go in the HTTP Data Gateway direction at this time.

@WDoranWMF is this interesting to thinking through as a future release?

TL; DR It's (thus far been) experimental in terms of approach, not in the sense that it creates any more (or less) risk in using it. And, it should not create any additional burden on the team developing the service (in fact it will alleviate some).

See: https://www.mediawiki.org/wiki/Platform_Engineering_Team/Data_Value_Stream/Data_Gateway

The raison d'etre of this work was about creating platforms to streamline our work, something that is presumably still important. I was the one that insisted on the label "experimental", and I called it that at the time because N-equaled-1 (Image Suggestions), and I didn't want to see us invest in anything unnecessarily elaborate without some evidence that there was more than one qualifying use-case. This project is that evidence. It fits the pattern exactly (better than Image Suggestions, really). That would make this the terminus of the experiment and —as far as I can see— it seems to have been successful. If there is a compelling reason not to use it, that would in fact indicate the experiment was a failure, and that we should no longer pursue this. I'm open to that being the case, and now is most definitely the time.