The Machine-Learning-Team is looking to cache the output of their article topic model (for all articles) in order to meet the scale and throughput requirements for Year in Review project (see: T401778).
Proposal
A machine learning model generates a mapping of 64 article topics predictions, to their corresponding probability scores:
[
{"topic":"Culture.Media.Media*","score":0.6859594583511353},
{"topic":"Culture.Biography.Biography*","score":0.5544804334640503},
{"topic":"Culture.Literature","score":0.5156299471855164},
...
]Clients need the ability to retrieve these topic predictions for a given page, where the score meets or exceeds a supplied threshold, and caching is necessary to meet performance expectations. To serve these cached predictions, an HTTP service —one that conforms to standard HTTP caching semantics— will be implemented. Changeprop will be utilized to capture change events and create, update or delete cache entries —via the service— as necessary.
| Fig. 1: Solution diagram |
Service
Request arguments:
| page_title | string | Wikipedia page title. |
| page_id | string | Wikipedia page id. |
| lang | string | Language of the Wiki. |
| (Optional) threshold | float | Minimum confidence threshold for prediction(s). Defaults to 0.5. |
| (Optional) debug | boolean | Debug flag used by ML team, sets threshold to 0 to see all predictions. Defaults to False. |
Data model
| Column | Type | Description |
| page_id | Text | The ID of the Wikipedia page |
| wiki_id | Text | Project identifier (i.e. enwiki, frwiki, etc) |
| model_version | Text | Version identifier of the article topic model used |
| predictions | map<text, float> | Mapping from topics to their predicted probability score |
| last_updated | DateTime | Timestamp of when this cache entry was last updated |
Estimating size
We can estimate the uncompressed size of data in each row:
| Column | Estimated Size | Explanation |
| page_id | 8 bytes | Most of the ids are 8 chars long. |
| wiki_id | 6 bytes | Typical wiki ID length |
| model_version | 20 bytes | Based on our current model version name. |
| predictions | 3500 bytes | Mapping of 64 topics to their probability scores. The full mapping in json format is ~3500 chars long. |
| last_used | 8 bytes | Datetime value |
This gives us ~3542 bytes per row (not accounting for compression or overheads). Since our plan assumes backfilling 65 million entries, we can estimate the size of data as 65,000,000 rows × 3,542 bytes ≈ 215 GB.
Cache backfilling
We plan to backfill the Cache using existing data - research team developed an Airflow pipeline running monthly, which generates topic predictions for all Wikipedia articles and saves the results to Hive table.
We want to use the latest results to backfill our Cache. To do this, we plan to create an ETL Airflow pipeline, which will load the existing data from Hive table, transform it to match the Cache schema and insert it into Cache in batch-processing fashion. This task is being tracked in https://phabricator.wikimedia.org/project/view/1901/.
Performance
Since our plan assumes 100% cache hit ratio, we need a strong guarantee for the read performance to sustain the load during the Year in Review project. We propose the following targets:
- P50 Read latency: <5ms
- P95 Read latency: <10ms
- P99 Read latency: <100ms
- Read throughput: 1000 queries per second (peak/surge during YiR season)
- Write throughput: 100 queries per second
Ownership
Machine Learning Team
Contact points: @BWojtowicz-WMF , @isarantopoulos
Expiration
31-12-2027
Decision Brief
Edit: We are revising the approach here to make use of an emerging paved pathway, tentatively called Linked Artifact Cache (gdoc proposal text here). The linked artifact cache service (maintained by Data-Persistence ) will persist the output of processes (derived data) that correspond to the content of pages (as article topics do). The cache service calls out to so-called lambdas to handle cache misses; The Machine-Learning-Team will implement a lambda service that returns the expected output (see above), which the artifact cache will store (and serve) verbatim. As with the original design, change events will be utilized to update the cache, but in this revised approach, by issuing an HTTP request to the cache service (using the updated revision). The initial backfill will be accomplished by iterating the corpus and likewise triggering cache misses with an HTTP request.
For storage, we've settled on the use of Cassandra, and to provision storage on the RESTBase cluster (where similar changeprop-updated persistent caching use-cases reside). Some earlier discussions had discussed the use of the Data Gateway (which is currently only available for the AQS/Generated Datasets cluster), but considering the service will need to connect directly to the database for create/update/deletes, it makes sense that it perform reads similarly (regardless of which cluster we deployed to). We propose the following schema:
CREATE TABLE IF NOT EXISTS ml_cache.topics ( wiki_id text, page_id bigint, model_version text, predictions map<text, float>, last_updated timestamp, PRIMARY KEY((wiki_id, page_id), model_version) );
Since lookups will always include the model_version attribute, one possibility was to make it a part of the partition key. We've opted instead to make the version a composite key, which preserves the ability to conduct range deletes in the future (ala DELETE FROM...WHERE model_version < ?). Use of a composite like this will cause the partition to grow each time a new version is used, but new versions are expected to be added very infrequently, and old versions needn't be retained for long.
On the use of wiki v. lang
Earlier versions of the proposal used two-character language codes in storage, since that corresponds to the argument passed to the service (the assumption being that this is a wikipedia-specific feature). This was changed after some discussion to use wiki_id, and values corresponding to those in dblists. The service will still accept two-letter language codes, but will map them to wiki_id when querying the database. This was done to better conform to the conventions used elsewhere, and preserve the future ability to extend the cache to non-WP projects.
See: T402984#11193553 (and follow-ups)
On the storage of/keying by page titles
Users refer to articles by their page title, this is the natural key. MediaWiki however uses a surrogate key (a monotonically increasing integer), since page titles can change (or be reused). Therefore, storing the title isn't a reference to the object, it is a duplication, and should be avoided for the sake of correctness. The alternative is to use MediaWiki's API to look up the page ID for a given title; MediaWiki is authoritative for this relationship, so it makes sense to request this of it. There is some concern that this won't be performant, but the team has agreed to try this approach first.
See also: T401778: Evaluate adding caching mechanism for article topic model to make data available at scale
