Page MenuHomePhabricator

Create new cassandra table data model for PCS
Closed, ResolvedPublic

Description

Background Information

After a lot of exploration we decided that PCS will manage its own cache outside of RESTBase unblock the sunset. For the short-term the restbase Cassandra cluster is considered a safe solution in terms of available resources and maintained infrastructure.

This work is required to remove the mobile-html and summary endpoints from RESTBase.

What

We need to model a new table for PCS cache that is consistent with the enforced standards and is approved by Data-Persistence.

How

Schema (see: r968717)
// Storage of various representations of MediaWiki content that must
// be pre-generated on page change.
CREATE KEYSPACE pregenerated_cache
    WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; // NTS & 3-per DC in production

// PCS (née mobileapps)
CREATE TABLE pregenerated_cache.media_list (
    project  text,
    key      text,
    headers  map<text,text>,
    cached   timestamp,
    value    blob,
    PRIMARY KEY ((project, key))
) WITH comment = 'Data about media items appearing on a given page';

CREATE TABLE pregenerated_cache.mobile_html (
    project  text,
    key      text,
    headers  map<text,text>,
    cached   timestamp,
    value    blob,
    PRIMARY KEY ((project, key))
) WITH comment = 'Page content optimized for mobile applications';

CREATE TABLE pregenerated_cache.page_summary (
    project  text,
    key      text,
    headers  map<text,text>,
    cached   timestamp,
    value    blob,
    PRIMARY KEY ((project, key))
) WITH comment = 'Page content summary';

Acceptance Criteria

  • A new data model is available for PCS to store mobile-html and summary endpoint caches.

Event Timeline

MSantos triaged this task as High priority.Oct 16 2023, 2:33 PM
MSantos moved this task from Needs Triage to Backlog on the Page Content Service board.
MSantos moved this task from Unsorted to PCS Service Pile on the RESTBase Sunsetting board.
MSantos updated the task description. (Show Details)

From a quick investigation i am thinking, why would we want a full fledged schema and instead of just a k/v? This way we can migrate easily to other systems (memcached or redis) in case we want to move to a caching strategy instead of full pregeneration (cc @Eevans)

From a quick investigation i am thinking, why would we want a full fledged schema and instead of just a k/v? This way we can migrate easily to other systems (memcached or redis) in case we want to move to a caching strategy instead of full pregeneration (cc @Eevans)

The current schema looks like:

CREATE TABLE keyspace.table_name (
    "_domain" text,
    key text,
    headers text,
    tid timeuuid,
    value blob,
    PRIMARY KEY (("_domain", key))
)

The primary key is a compound value made up of the project domain and a page title, headers (which should have used type map<string>, but I digress...) stores the HTTP headers to be used when answering requests, and value is the cached content. The tid attribute is a type 1 (time-based) UUID, it's globally unique with a temporal component that can be extracted for purposes of a timestamp. I'm not sure why a UUID was used here —probably another RESTBase-ism— but having a timestamp that corresponds with when the content was cached certainly seems useful. So a "corrected" version of the above might look like:

CREATE TABLE keyspace.table_name (
    project  text,
    key      text,
    headers  map<text>,
    cached   timestamp,
    value    blob,
    PRIMARY KEY ((project, key))
)

I'm not sure this represents the full schema either, for example: I wanted to ask whether or not it made sense to be versioning the output. If the cached output/format/encoding ever needed to change, having a means of informing the client (without the requirement of deserializing) seems like it might be useful. It might also facilitate auditing storage, maintenance jobs, etc.

At any rate, the schema here is clearly more than opaque key/value. Nothing we do in storage will make that schema go away; There is no such thing as schema-less. If it isn't defined in the database, then it exists implicitly in the code, or as an external implementation (validation, constraints, and marshaling/unmarshaling of values). The former is brittle and buggy, and the latter is reinventing mechanisms the database gives us for free.

Finally, no one has asked for this (and I'm not proposing it at this time), but if we ever had need, the above leaves open the possibility of indexing headers, or the cached-on timestamp, (or a content version), so that queries could be made against them. Again, that could provide useful down the road, but isn't a possibility if those attributes are opaquely encoded into the value blob.

Since we are moving from a generic storage solution to a bespoke interim version do we need the text column for project? My initial thought was more something like:

CREATE TABLE <project>.<env> (
    id text PRIMARY KEY,
    cached timestamp,
    value blob  // response body
)

What I am still thinking is if there is an actual use of keeping the headers in our storage. In case we actually need it we can serialize both body/headers somehow in the value blob. Thoughts?

Personally I really believe its going to be useful in the future for us to be able to flip a switch and use the same schema for different storage backends (or a combination of those based on endpoint needs) in the future (with or without pregeneration) and by having a very minimal (hence my initial question for simple k/v) would allow us to use redis/memcached too.

Since we are moving from a generic storage solution to a bespoke interim version do we need the text column for project? My initial thought was more something like:

CREATE TABLE <project>.<env> (
    id text PRIMARY KEY,
    cached timestamp,
    value blob  // response body
)

You still have to namespace by project though, don't you? How else would you disambiguate es.wikipedia.org/wiki/Barack_Obama from en.wikipedia.org/wiki/Barack_Obama? If the idea is to do this in code by concatenating them around some delimiter (ala eswiki:Barack_Obama), then I'd ask why? Why invent your own key encoding when the database can do this for you?

What I am still thinking is if there is an actual use of keeping the headers in our storage. In case we actually need it we can serialize both body/headers somehow in the value blob. Thoughts?

If you are storing headers, then they are part of the schema. If you are eliding that schema in the database, then you will have to perform all the same validation, application of constraints, (de)serialization, etc elsewhere. This mirrors what I said about the project and page title above (they too are part of the schema), moving any of this out of the database doesn't save you from having to deal with it, and re-implementing wheels isn't something we should be doing.

Personally I really believe its going to be useful in the future for us to be able to flip a switch and use the same schema for different storage backends (or a combination of those based on endpoint needs) in the future (with or without pregeneration) and by having a very minimal (hence my initial question for simple k/v) would allow us to use redis/memcached too.

There is a lot to unpack here in this last sentence, but I'll try... :)

I think that over-generalizing early is something we (as engineers) are prone to do (I've been guilty of this too many times to count). I think that at the WMF though it's endemic —in fact, it's exactly the reason that RESTBase failed, and that we find ourselves here. It's probably worse here because we're so resource constrained that we're desperately looking for force-multipliers. And history (at the WMF) tells me that even if we see this as interim, once it's in place and working it could be years before we revisit it (because we're resource constrained, and are forced to be selective). As a result, I'd rather focus on solving the problem in front of us in a robust and maintainable way, even while we look ahead at the what-ifs.

That said: I would argue that if pregeneration is a requirement, then this is probably the best way to do it. If pregeneration isn't required, I'm not sure why it wouldn't be enough to rely on our HTTP caching infrastructure.

And regarding generic vs bespoke: Whatever schema we use here can be reused for anything that matches the same requirements (which I assume would be most/all examples of persistently cached alternative outputs of mediawiki content). We'd use a different table each time, but a client-side storage abstraction could be reused. And if the day ever comes that you do want to store to Redis or Memcached, then an alternative implementation of that abstraction could be made. The downside being that you'd have to spend the extra time then, that you're saving now, implementing all the schema validation, etc.

Thanks for your feedback. At this point I agree that priority is to unblock RESTBase sunset and we are currently blocked on the persistency layer so lets try to be as close as possible with the restbase model with the corrections you suggested.
Let us know how we can move forward. I guess some next steps would be:

  • Initialize tables
  • Setup users
  • Allow network flow from PCS kubernetes pods to cassandra
  • Setup listeners for the node service to connect to DB.

Because of cassandra's replication architecture we wont need different environments between codfw/eqiad right? Should we have a seperate storage for staging? PCS is currectly deployed on k8s with a staging deployment (eqiad) and 2 production deployments (codfw, eqiad).

Thanks for your feedback. At this point I agree that priority is to unblock RESTBase sunset and we are currently blocked on the persistency layer so lets try to be as close as possible with the restbase model with the corrections you suggested.

Ok, so to be clear: We need two tables, one corresponding to the existing mobile-html table, and one for page_summary, is that correct?

And both of these come from the same service (PCS), is that correct?

My current thinking is to create a new keyspace for persistently cached, pre-generated mediawiki content. Ideally we'd have a single user —with appropriate table grants— for each service (to avoid a situation where one mis-configured service might inadvertently trample the data of another).

Let us know how we can move forward. I guess some next steps would be:

  • Initialize tables
  • Setup users
  • Allow network flow from PCS kubernetes pods to cassandra
  • Setup listeners for the node service to connect to DB.

Because of cassandra's replication architecture we wont need different environments between codfw/eqiad right?

Correct.

Should we have a seperate storage for staging?

We have a Cassandra dev cluster in codfw, that's what we use (from eqiad) for staging the echo and session storage services. I can create these tables there as well. That said: It's not a formal staging environment —it's primary purpose is test/dev of Cassandra— so treat it as Best Effort™ for now.

PCS is currectly deployed on k8s with a staging deployment (eqiad) and 2 production deployments (codfw, eqiad).

That's not a problem; Cassandra should already be accessible from the k8s cluster. We'll need to fix your deployment chart to source the credentials as secrets. I've never done that, but perhaps @hnowlan (who has), can help?

Ok, so to be clear: We need two tables, one corresponding to the existing mobile-html table, and one for page_summary, is that correct?

And both of these come from the same service (PCS), is that correct?

If we break it down to each different functionality then we have:

  • media-list
  • mobile-html
  • summary

All of them are backed from the same service (mobileapps).

That's not a problem; Cassandra should already be accessible from the k8s cluster. We'll need to fix your deployment chart to source the credentials as secrets. I've never done that, but perhaps @hnowlan (who has), can help?

For sure, just point me at a chart/config and I'll sort it out.

Change 968717 had a related patch set uploaded (by Eevans; author: Eevans):

[mediawiki/services/mobileapps@master] Add cassandra schema

https://gerrit.wikimedia.org/r/968717

Ok, we need to formalize/document this better on the Data-Persistence side of things, but best-practice should be to keep the schema with the implementing code; Changes to the schema should move in lock-step with the code changes that necessitated them. To that end, I've opened: r968717.

Remaining steps:

  • Get r968717 reviewed
  • Create the schema in production & dev
  • Create the role, grants, and corresponding credentials
  • Update the deployment charts to source the credentials
  • Establish a protocol for loading the new tables (since we probably do not have the space to store this twice)

The last two items probably warrant opening separate tickets for.


image.png (834×1 px, 163 KB)

[ ... ]

Remaining steps:

  • Get r968717 reviewed
  • Create the schema in production & dev
  • Create the role, grants, and corresponding credentials
  • Update the deployment charts to source the credentials
  • Establish a protocol for loading the new tables (since we probably do not have the space to store this twice)

The last two items probably warrant opening separate tickets for.

[ ... ]

Ok, r968717 still needs to be merged, it has passed review, but is currently blocked by an unrelated build failure.

In the meantime, I have created the schema on the restbase cluster (production), and on the cassandra-dev cluster (staging).

Change 970848 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: add grants for new mobileapps tables

https://gerrit.wikimedia.org/r/970848

Change 970848 merged by Eevans:

[operations/puppet@production] cassandra: add grants for new mobileapps tables

https://gerrit.wikimedia.org/r/970848

Change 971269 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: add grants for new mobileapps tables (redux)

https://gerrit.wikimedia.org/r/971269

Change 971269 merged by Eevans:

[operations/puppet@production] cassandra: add grants for new mobileapps tables (redux)

https://gerrit.wikimedia.org/r/971269

Change 971278 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra_dev: add mediawiki_services_mobileapps role & grants

https://gerrit.wikimedia.org/r/971278

Change 971278 merged by Eevans:

[operations/puppet@production] cassandra_dev: add mediawiki_services_mobileapps role & grants

https://gerrit.wikimedia.org/r/971278

[ ... ]

Remaining steps:

  • Get r968717 reviewed
  • Create the schema in production & dev
  • Create the role, grants, and corresponding credentials
  • Update the deployment charts to source the credentials
  • Establish a protocol for loading the new tables (since we probably do not have the space to store this twice)

The last two items probably warrant opening separate tickets for.

[ ... ]

A new role has been created: mediawiki_services_mobileapps. It has been granted SELECT and MODIFY to all three tables. A password has been added to the private Git repo (your deployment charts will have access to it).

Change 971504 had a related patch set uploaded (by Eevans; author: Eevans):

[labs/private@master] cassandra: password for mediawiki_services_mobileapps role

https://gerrit.wikimedia.org/r/971504

[ ... ]

Remaining steps:

  • Get r968717 reviewed
  • Create the schema in production & dev
  • Create the role, grants, and corresponding credentials
  • Update the deployment charts to source the credentials
  • Establish a protocol for loading the new tables (since we probably do not have the space to store this twice)

The last two items probably warrant opening separate tickets for.

[ ... ]

I created T350507 for tracking the work needed for the k8s deployment chart, and I updated T348995 to indicate a need for a migration strategy to work within capacity constraints.

We should probably leave this open until r968717 has been merged, but I think that it is otherwise complete.

Change 971504 merged by Eevans:

[labs/private@master] cassandra: password for mediawiki_services_mobileapps role

https://gerrit.wikimedia.org/r/971504

Change 968717 merged by jenkins-bot:

[mediawiki/services/mobileapps@master] Add cassandra schema

https://gerrit.wikimedia.org/r/968717