Page MenuHomePhabricator

Data Persistence Design Review: Article topic model caching
Open, MediumPublic

Description

The Machine-Learning-Team is looking to cache the output of their article topic model (for all articles) in order to meet the scale and throughput requirements for Year in Review project (see: T401778).


Proposal

A machine learning model generates a mapping of 64 article topics predictions, to their corresponding probability scores:

[
  {"topic":"Culture.Media.Media*","score":0.6859594583511353},
  {"topic":"Culture.Biography.Biography*","score":0.5544804334640503}, 
  {"topic":"Culture.Literature","score":0.5156299471855164},
  ...
]

Clients need the ability to retrieve these topic predictions for a given page, where the score meets or exceeds a supplied threshold, and caching is necessary to meet performance expectations. To serve these cached predictions, an HTTP service —one that conforms to standard HTTP caching semantics— will be implemented. Changeprop will be utilized to capture change events and create, update or delete cache entries —via the service— as necessary.

image.png (1×3 px, 717 KB)
Fig. 1: Solution diagram
Service

Request arguments:

page_titlestringWikipedia page title.
page_idstringWikipedia page id.
langstringLanguage of the Wiki.
(Optional) thresholdfloatMinimum confidence threshold for prediction(s). Defaults to 0.5.
(Optional) debugbooleanDebug flag used by ML team, sets threshold to 0 to see all predictions. Defaults to False.
NOTE: Users can pass arguments of either page_id or page_title. The Machine Learning and Apps teams agreed that for the needs of Year in Review project, all requests will use page_id.
Data model
ColumnTypeDescription
page_idTextThe ID of the Wikipedia page
wiki_idTextProject identifier (i.e. enwiki, frwiki, etc)
model_versionTextVersion identifier of the article topic model used
predictionsmap<text, float>Mapping from topics to their predicted probability score
last_updatedDateTimeTimestamp of when this cache entry was last updated
NOTE: The model_version attribute indicates the model that was used to generate the prediction (e.g. model_version=alloutlinks_202209). Since it can change and influence the predictions, we also need to track this as well.
Estimating size

We can estimate the uncompressed size of data in each row:

ColumnEstimated SizeExplanation
page_id8 bytesMost of the ids are 8 chars long.
wiki_id6 bytesTypical wiki ID length
model_version20 bytesBased on our current model version name.
predictions3500 bytesMapping of 64 topics to their probability scores. The full mapping in json format is ~3500 chars long.
last_used8 bytesDatetime value

This gives us ~3542 bytes per row (not accounting for compression or overheads). Since our plan assumes backfilling 65 million entries, we can estimate the size of data as 65,000,000 rows × 3,542 bytes ≈ 215 GB.

Cache backfilling

We plan to backfill the Cache using existing data - research team developed an Airflow pipeline running monthly, which generates topic predictions for all Wikipedia articles and saves the results to Hive table.
We want to use the latest results to backfill our Cache. To do this, we plan to create an ETL Airflow pipeline, which will load the existing data from Hive table, transform it to match the Cache schema and insert it into Cache in batch-processing fashion. This task is being tracked in https://phabricator.wikimedia.org/project/view/1901/.

Performance

Since our plan assumes 100% cache hit ratio, we need a strong guarantee for the read performance to sustain the load during the Year in Review project. We propose the following targets:

  • P50 Read latency: <5ms
  • P95 Read latency: <10ms
  • P99 Read latency: <100ms
  • Read throughput: 1000 queries per second (peak/surge during YiR season)
  • Write throughput: 100 queries per second
Ownership

Machine Learning Team
Contact points: @BWojtowicz-WMF , @isarantopoulos

Expiration

31-12-2027


Decision Brief

Edit: We are revising the approach here to make use of an emerging paved pathway, tentatively called Linked Artifact Cache (gdoc proposal text here). The linked artifact cache service (maintained by Data-Persistence ) will persist the output of processes (derived data) that correspond to the content of pages (as article topics do). The cache service calls out to so-called lambdas to handle cache misses; The Machine-Learning-Team will implement a lambda service that returns the expected output (see above), which the artifact cache will store (and serve) verbatim. As with the original design, change events will be utilized to update the cache, but in this revised approach, by issuing an HTTP request to the cache service (using the updated revision). The initial backfill will be accomplished by iterating the corpus and likewise triggering cache misses with an HTTP request.

For storage, we've settled on the use of Cassandra, and to provision storage on the RESTBase cluster (where similar changeprop-updated persistent caching use-cases reside). Some earlier discussions had discussed the use of the Data Gateway (which is currently only available for the AQS/Generated Datasets cluster), but considering the service will need to connect directly to the database for create/update/deletes, it makes sense that it perform reads similarly (regardless of which cluster we deployed to). We propose the following schema:

CREATE TABLE IF NOT EXISTS ml_cache.topics (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    predictions   map<text, float>,
    last_updated  timestamp,
    PRIMARY KEY((wiki_id, page_id), model_version)
);

Since lookups will always include the model_version attribute, one possibility was to make it a part of the partition key. We've opted instead to make the version a composite key, which preserves the ability to conduct range deletes in the future (ala DELETE FROM...WHERE model_version < ?). Use of a composite like this will cause the partition to grow each time a new version is used, but new versions are expected to be added very infrequently, and old versions needn't be retained for long.

On the use of wiki v. lang

Earlier versions of the proposal used two-character language codes in storage, since that corresponds to the argument passed to the service (the assumption being that this is a wikipedia-specific feature). This was changed after some discussion to use wiki_id, and values corresponding to those in dblists. The service will still accept two-letter language codes, but will map them to wiki_id when querying the database. This was done to better conform to the conventions used elsewhere, and preserve the future ability to extend the cache to non-WP projects.

See: T402984#11193553 (and follow-ups)

On the storage of/keying by page titles

Users refer to articles by their page title, this is the natural key. MediaWiki however uses a surrogate key (a monotonically increasing integer), since page titles can change (or be reused). Therefore, storing the title isn't a reference to the object, it is a duplication, and should be avoided for the sake of correctness. The alternative is to use MediaWiki's API to look up the page ID for a given title; MediaWiki is authoritative for this relationship, so it makes sense to request this of it. There is some concern that this won't be performant, but the team has agreed to try this approach first.

💡In the event that title duplication becomes unavoidable, we should strongly consider implementing an event stream-updated centralized cache instead
💡Another option for services that need to query by-id from external storage systems would be to implement the service endpoints in MediaWiki, where the title-to-id mapping could be done directly.

See also: T401778: Evaluate adding caching mechanism for article topic model to make data available at scale

Event Timeline

PEPE1234.13 renamed this task from Data Persistence Design Review: Article topic model caching to Del caching.Sep 6 2025, 1:28 PM
PEPE1234.13 closed this task as Invalid.
PEPE1234.13 triaged this task as Unbreak Now! priority.
Aklapper renamed this task from Del caching to Data Persistence Design Review: Article topic model caching.Sep 6 2025, 2:15 PM
Aklapper reopened this task as Open.
Aklapper lowered the priority of this task from Unbreak Now! to Needs Triage.

Why do we need Cache

Machine Learning Team decided to add Cache mechanism to our article topic model in order to meet the scale and throughput requirements for Year in Review project. Extensive description of the task and previous discussions on Cache design can be found here: https://phabricator.wikimedia.org/T401778.

Solution Diagram

image.png (1×3 px, 717 KB)

Considerations for the Table Schema

Data received in the request
Users can send us the following arguments:

  • page_title: str - Wikipedia page title.
  • page_id: str - Wikipedia page id.
  • lang: str - Language of the Wiki.
  • (Optional) threshold: float - Minimum confidence threshold for prediction(s). Defaults to 0.5.
  • (Optional) debug: bool - Debug flag used by ML team, sets threshold to 0 to see all predictions. Defaults to False.

Users can pass either page_id or page_title. Machine Learning Team and Apps team agreed that for the needs of Year in Review project, all requests will use page_id.

Data defined by the deployment
Each deployment currently contains the model_version environment variable, which indicates the model that was used to generate the prediction e.g. model_version=alloutlinks_202209. Since it can change and influence the predictions, we also need to include this information in the schema.

Data model generates
Our model generates predictions, which are a mapping from 64 topics to their probability scores, which looks like this:

[
  {"topic":"Culture.Media.Media*","score":0.6859594583511353},
  {"topic":"Culture.Biography.Biography*","score":0.5544804334640503}, 
  {"topic":"Culture.Literature","score":0.5156299471855164},
  ...
]

Our application returns to the user all topics where score >= threshold.

Table Schema

I'm suggesting to use a composite key consisting of 3 primary keys: page_id, lang, model_version. This composite key uniquely identifies topic predictions for a page and allows for efficient point queries.

ColumnTypeKey TypeDescription
page_idTextPartition KeyThe ID of the Wikipedia page
langTextPartition KeyLanguage code for the page (e.g., 'en', 'fr', 'es')
model_versionTextPartition KeyVersion identifier of the article topic model used
predictionsmap<text, float>-Mapping from topics to their predicted probability score
last_updatedDateTime-Timestamp of when this cache entry was last updated

Estimating DB size

We can estimate the uncompressed size of data in each row:

ColumnEstimated SizeExplanation
page_id8 bytesMost of the ids are 8 chars long.
lang2 bytesLanguage codes are 2 letters long.
model_version20 bytesBased on our current model version name.
predictions3500 bytesMapping of 64 topics to their probability scores. The full mapping in json format is ~3500 chars long.
last_used8 bytesDatetime value

This gives us ~3538 bytes per row without any compression. Since our plan assumes backfilling 65 million entries, we can estimate the size of data as 65,000,000 rows × 3,538 bytes ≈ 215 GB. On top of this, Cassandra adds a small storage overhead for e.g. partitioning/clustering metadata, indexes, SSTables. However, this overhead should not be bigger then ~100 bytes per row, which would make our DB ~3% bigger, which would put the uncompressed data size at ~220 GB.

Backfilling cache

We plan to backfill the Cache using existing data - research team developed an Airflow pipeline running monthly, which generates topic predictions for all Wikipedia articles and saves the results to Hive table.
We want to use the latest results to backfill our Cache. To do this, we plan to create an ETL Airflow pipeline, which will load the existing data from Hive table, transform it to match the Cache schema and insert it into Cache in batch-processing fashion. This task is being tracked in https://phabricator.wikimedia.org/project/view/1901/.

Cache Invalidation

As shown in the Diagram, our model listens to the Changeprop event stream, which means we get a request on each Wikipedia page change event. We plan to take advantage of this fact and use the requests originating from Changeprop to invalidate/update entries.
This means that for page change event, we will generate a new prediction mapping and update the Cache row. In case we receive page deletion event, we will remove the Cache row for the associated page.

Performance Expectations

Since our plan assumes 100% cache hit ratio, we need a strong guarantee for the read performance to sustain the load during the Year in Review project.
I'm suggesting the following targets:

  • P50 Read latency: <5ms
  • P95 Read latency: <10ms
  • P99 Read latency: <100ms
  • Read throughput: 1000 queries per second
  • Write throughput: 100 queries per second

Nice!

lang Text Partition Key Language code for the page (e.g., 'en', 'fr', 'es')

Suggestion to standardize wiki differentiation on wiki_id, rather than lang for the data field. (Your API/UI param can do whatever you need :) )

https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#Wiki_vs._wiki_id_vs._wiki_db_vs._project

Suggestion to standardize wiki differentiation on wiki_id, rather than lang for the data field. (Your API/UI param can do whatever you need :) )

https://wikitech.wikimedia.org/wiki/Data_Platform/Data_modeling_guidelines#Wiki_vs._wiki_id_vs._wiki_db_vs._project

@Ottomata I assume that in this case we'd have to use the full wiki database as it appears in the dblists. So that would be enwiki instead of en. Was that what you were implying, or am I jumping to conclusions?

So that would be enwiki instead of en

For your data storage, yes! For your API/UI parameters, whatever you think is best.

You might cache different model outputs in the future, and some might work for other multilanguage wikis too. Perhaps you have a model that works for both wikipedia and for wikitionary. You'd want to store data that helps you uniquely id a page in a wiki. This would be (wiki_id, page_id), e.g. ("enwiki", 123) or ("enwiktionary", 456).

I think there are a lot of commonalities between page prediction model caching, structured task storage, and other 'wiki page entity derived data'. I have a hunch that we could standardize and streamline a lot of the process of storing, serving and maintaining these kinds of datasets. Having a common data model for keys for this kind of data will be helpful when doing that. It also means that you don't have to reason about which wiki an 'en' page belongs to :)

use the full wiki database as it appears in the dblists

Ya. FWIW wiki_id does not strictly == database name. It just happens to here at Wikimedia Foundation deployed wikis.

https://www.mediawiki.org/wiki/Manual:Wiki_ID

@BWojtowicz-WMF let's adopt the wiki_id for the data field and we can continue to use lang to avoid altering the api parameters.
@Eevans Is there anything else required from the ML team for the Design review? Is there an estimate about when this can delivered so that we can plan the appropriate integration with the service and the necessary backfill? Thanks!

Why do we need Cache

Machine Learning Team decided to add Cache mechanism to our article topic model in order to meet the scale and throughput requirements for Year in Review project. Extensive description of the task and previous discussions on Cache design can be found here: https://phabricator.wikimedia.org/T401778.

[ ... ]

Table Schema

I'm suggesting to use a composite key consisting of 3 primary keys: page_id, lang, model_version. This composite key uniquely identifies topic predictions for a page and allows for efficient point queries.

ColumnTypeKey TypeDescription
page_idTextPartition KeyThe ID of the Wikipedia page
langTextPartition KeyLanguage code for the page (e.g., 'en', 'fr', 'es')
model_versionTextPartition KeyVersion identifier of the article topic model used
predictionsmap<text, float>-Mapping from topics to their predicted probability score
last_updatedDateTime-Timestamp of when this cache entry was last updated

I suggest the following:

CREATE TABLE IF NOT EXISTS ml_cache.topics (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    predictions   map<text, float>,
    last_updated  timestamp,
    PRIMARY KEY((wiki_id, page_id), model_version)
);

This uses wiki_id in place of lang (see earlier comments), uses bigint for page_id, and (wiki_id, page_id) as the partition key, with model_version as a composite.

page_id is an integer, so better to have the schema reflect that (and for the database to apply that as constraint).

I'm suggesting to use a composite for model_version here so that it's possible to cleanup past versions later with a DELETE FROM...WHERE model_version < {some_version}. It queries the same way (both inserts & selects). The partition will grow each time you store predictions under a new version, but I'm given to understand that happens very infrequently? It would take a lot of versions for this to become a problem.

Does this make sense?

[ ... ]

Backfilling cache

We plan to backfill the Cache using existing data - research team developed an Airflow pipeline running monthly, which generates topic predictions for all Wikipedia articles and saves the results to Hive table.
We want to use the latest results to backfill our Cache. To do this, we plan to create an ETL Airflow pipeline, which will load the existing data from Hive table, transform it to match the Cache schema and insert it into Cache in batch-processing fashion. This task is being tracked in https://phabricator.wikimedia.org/project/view/1901/.

Cache Invalidation

As shown in the Diagram, our model listens to the Changeprop event stream, which means we get a request on each Wikipedia page change event. We plan to take advantage of this fact and use the requests originating from Changeprop to invalidate/update entries.
This means that for page change event, we will generate a new prediction mapping and update the Cache row. In case we receive page deletion event, we will remove the Cache row for the associated page.

Something to keep in mind: CQL semantics allow you to replace the entire contents of a map, append to it, or (over)write individual elements (see: https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/cql_singlefile.html#map). I assume you'll be connecting directly to the database and executing queries for the changeprop updates, and using the spark loader for the backfilling? If so, the former shouldn't be a problem, but the latter might require some testing (I'm not sure how writes to a map work from that loader).

Performance Expectations

Since our plan assumes 100% cache hit ratio, we need a strong guarantee for the read performance to sustain the load during the Year in Review project.
I'm suggesting the following targets:

  • P50 Read latency: <5ms
  • P95 Read latency: <10ms
  • P99 Read latency: <100ms
  • Read throughput: 1000 queries per second
  • Write throughput: 100 queries per second

I think these are all quite reasonable. 1000 qps though might require us to scale up the Data Gateway though!

@Ottomata @isarantopoulos
Thank you for the suggestion and discussion about using the wiki_id. The article model does not currently work for other Wikis, but I very much like the idea if standardizing our DB schemas across different models to use page_id and wiki_id for indices.
To not alter the current API parameters to the model, which expects lang parameter, I've created a static lang->wiki_id mapping for each Wikipedia language, which will be used internally by our application code to translate between lang and wiki_id when interacting with cache.


@Eevans

I suggest the following:

CREATE TABLE IF NOT EXISTS ml_cache.topics (

wiki_id       text,
page_id       bigint,
model_version text,
predictions   map<text, float>,
last_updated  timestamp,
PRIMARY KEY((wiki_id, page_id), model_version)

);
This uses wiki_id in place of lang (see earlier comments), uses bigint for page_id, and (wiki_id, page_id) as the partition key, with model_version as a composite.

page_id is an integer, so better to have the schema reflect that (and for the database to apply that as constraint).

I'm suggesting to use a composite for model_version here so that it's possible to cleanup past versions later with a DELETE FROM...WHERE model_version < {some_version}. It queries the same way (both inserts & > selects). The partition will grow each time you store predictions under a new version, but I'm given to understand that happens very infrequently? It would take a lot of versions for this to become a problem.

Does this make sense?

Thank you for suggesting the schema, your proposal sounds very good to me!

The model_version does indeed change very infrequently, currently we still use model trained in 2022. The frequency could increase in the future as ML Team is working on Airflow pipelines for automatic retraining of our models, but AFAIK such pipeline is not yet planned for the article topic model. So I think we're good to use the model_version for the composite key!

Something to keep in mind: CQL semantics allow you to replace the entire contents of a map, append to it, or (over)write individual elements (see: https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/cql_singlefile.html#map). I assume you'll be connecting directly to the database and executing queries for the changeprop updates, and using the spark loader for the backfilling? If so, the former shouldn't be a problem, but the latter might require some testing (I'm not sure how writes to a map work from that loader).

Thank you about the CQL note! I think in our case, we'd be overwriting the entire predictions map each time we update the entry.

Our plan is how you described it - we plan to execute queries for changeprop updates by connecting to the DB directly.
For the backfilling, we plan to do similar approach as @Ottomata shared here using Spark: https://phabricator.wikimedia.org/T403254#11163973. I agree that it will definitely require some testing from our side.

I think these are all quite reasonable. 1000 qps though might require us to scale up the Data Gateway though!

This performance expectation was linked directly to the Year in Review project, where we expected to process a few hundred queries per second, thus 1000QPS would be a safe choice with some error margin. However, it was decided lately that the article topic model will not be used in the incoming Year in Review, but possibly in the next year's edition. Thus, we will not be hitting the 1000QPS anytime soon, but such load would be possible in the future.

[ ... ]

I think these are all quite reasonable. 1000 qps though might require us to scale up the Data Gateway though!

This performance expectation was linked directly to the Year in Review project, where we expected to process a few hundred queries per second, thus 1000QPS would be a safe choice with some error margin. However, it was decided lately that the article topic model will not be used in the incoming Year in Review, but possibly in the next year's edition. Thus, we will not be hitting the 1000QPS anytime soon, but such load would be possible in the future.

Oh, that's right! I'd forgotten we expected a short-term surge for YiR. This is definitely the sort of thing we'd want to capture and have a plan for (or at least awareness of).


On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes though to Cassandra... which sounds familiar?

We talked about using the Data Gateway, but it does not serve the RESTBase cluster. It could, if that were useful, but it's something we'd have to do (either by wiring the existing DGW service to the other cluster, or using a separately deployed gateway). But, is the Data Gateway even useful here? I mean, you'll already be managing a connection to Cassandra from the ArticleTopic Entrypoint for making writes, it's probably easiest to just query directly for reads as well (regardless of which cluster we use).

Were you still thinking you'd use the Data Gateway, or querying directly for writes & reads? And if the former, would doing the latter instead be a problem?

On an somewhat related note: I'm bouncing around the idea that perhaps your use-case is a better fit for the RESTBase cluster (RESTBase, like AQS, is a misnomer here, both are multi-tenant clusters). The AQS > cluster is (or at least has been) geared more toward materialized representations, analytics, etc. The things persisting data there mostly follow an ETL pattern (even though we've talked about using event > streams, and a more Lamba architecture). Most of what is there is time-series, or versioned, where data is written but not updated. The RESTBase cluster has primarily been for caching (and a bit of application > state). Primarily caching alternate representations of content, but caching nonetheless. Those caches have been maintained by changeprop jobs, jobs that hit a service with a no-cache header, which then writes > though to Cassandra... which sounds familiar?

I totally agree that for our model cache the characteristics of RESTBase cluster sound way more fitting then the AQS cluster.

We talked about using the Data Gateway, but it does not serve the RESTBase cluster. It could, if that were useful, but it's something we'd have to do (either by wiring the existing DGW service to the other > cluster, or using a separately deployed gateway). But, is the Data Gateway even useful here? I mean, you'll already be managing a connection to Cassandra from the ArticleTopic Entrypoint for making writes, it's > probably easiest to just query directly for reads as well (regardless of which cluster we use).

Were you still thinking you'd use the Data Gateway, or querying directly for writes & reads? And if the former, would doing the latter instead be a problem?

I agree that the ArticleTopic deployment wouldn't need the Data Gateway at all as you suggested. We would use the direct connection to both read and write there.


There's one caveat I can see in this proposal and in my design as well - once the cache will be backfilled and "live", the data stored in it will be very valuable for analytics. In the design, we do not assume that Cache will have other users then our ArticleTopic deployment, but I can see a lot of value in having the ability to access this data by either Machine Learning Team or Research Team. In this case, having the Data Gateway for reading would be very valuable. However, the main mode of operation would still be closer to the RESTBase cluster.

@isarantopoulos Do you think that having the access to the Cache data for analytics is a use-case we need to take into account here as well?

Since the events that are produced (prediction data) are ingested in a hive table event.mediawiki_page_outlink_topic_prediction_change_v1 we can utilize that for analytics purposes. The data are available there for 90 days and we are looking to increase the retention period in T405358: Add LiftWing streams data to event_sanitized (increase data retention)
+1 on querying directly and not going through the data gateway

In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.

@Eevans I have a small curiosity question regarding the RESTBase vs AQL - for our type of real-time, short transaction processing should we expect better performance if deployed on RESTBase? What kind of gearing towards this OLTP processing is on RESTBase vs towards OLAP processing on the AQS server?

In this case I also agree that querying directly without Data Gateway would be the best option for us as well as deploying on RESTBase.

@Eevans I have a small curiosity question regarding the RESTBase vs AQL - for our type of real-time, short transaction processing should we expect better performance if deployed on RESTBase? What kind of gearing towards this OLTP processing is on RESTBase vs towards OLAP processing on the AQS server?

That's a great question, and honestly... the short answer is that I'm not sure I expect there to be any difference in that regard (yet?).

The somewhat longer answer (and for posterity sake): Both of those clusters started out as single-tenancy, each supporting their own service (or closely related set of services). AQS (read: The Services) were OLAP, and RESTBase (read: The Services) was more OLTP (and largely persistent caching). The AQS cluster was transitioned to being multi-tenant when it was obvious that we had interest in other use cases persisting & serving OLAP-generated datasets (and using the same tools and infrastructure too). RESTBase-the-software was sunsetted, and alternative implementations for what it had been doing were created, some of which remained on that cluster. That's how we arrived at where we are today.

The AQS cluster nodes do use a different disk configuration, one that trades extra disks (aka storage is more expensive) for (in theory) better read performance (they are RAID10). However, with improved support for JBOD in Cassandra, it's our intention to move all of the clusters to that, so that distinction will go away. The AQS cluster also has higher density (more storage per node), which makes the "blast radius" in the event of a failure greater. That's not been a problem in practice, but it probably makes a case for RESTBase being better suited to the sorts of applications deployed there(?)

In summary: There isn't much that separates the two clusters, and it's fair to question whether or not article topic cache could go on one versus the other (I can't see it creating any material difference to how your service would function). It's also fair to question whether we even need both clusters, or if one could do the job. History though has created the precedent, and so I am leaning toward preserving the distinction until we decide to do otherwise. It's definitely not hard to imagine leveraging that distinction, if not in hardware, then infrastructure, management practices, etc.

Eevans triaged this task as Medium priority.Oct 1 2025, 11:45 PM
Eevans updated the task description. (Show Details)

@Eevans
Thank you very much for elaborating on the history and differences between those two. I was curious what kind of optimizations could be done there like the RAID10 storage and higher density, it's very interesting!
I agree that even if there are no major differences, we should still deploy our Cache in the RESTBase cluster, which is meant for this type of processing.

I see you filled out the description with all the discussed details, thank you a lot! Is there anything else from our side that is needed at the moment?

[ ... ]

I see you filled out the description with all the discussed details, thank you a lot!

I'm not quite sure where I'm going with that (the description, format, etc); I'm just trying some ideas out!

Is there anything else from our side that is needed at the moment?

I've still got more to add to the description —namely a decision brief section, with a summary of the "how we decided to do this". If you can (continue) to make sure that everything that is there is correct, and what we discussed, that would be great.

We're also trying out a couple of new items: Ownership and Expiration (see https://wikitech.wikimedia.org/wiki/SRE/Data_Persistence/Design_Review). Ownership is a team + at least two people/contacts, one of which should be a manager. Expiration is a date that represents the period of time that the listed owners are committing to maintenance, after which the data product becomes a candidate for removal. Anyone can extend that expiration at any time, no questions asked (though of course, in doing so they de facto volunteer to become the owner! 😀). Can you fill those out as well?

I have updated ownership and expiration date
@Eevans There has been a change of plans regarding the integration of this work with this years Year In Review so although we still need this Cassandra instance the request that we have filed for the improve tone structured task in T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task is of higher priority .I just wanted to mention this so you can handle your priorities and timelines accordingly.

I have updated ownership and expiration date
@Eevans There has been a change of plans regarding the integration of this work with this years Year In Review so although we still need this Cassandra instance the request that we have filed for the improve tone structured task in T401021: Data Persistence Design Review: Improve Tone Suggested Edits newcomer task is of higher priority .I just wanted to mention this so you can handle your priorities and timelines accordingly.

Thanks that helps; I think we're now ready to move forward with this one, but I'll shift priorities to T401021 in the meantime.