The Growth-Team and Machine-Learning-Team are working to create a GrowthExperiments newcomer task for suggesting edits to improve tone (associated hypotheses are WE1.1.2 and WE1.1.8 respectively).
Proposal
Changeprop will be utilized to process article change events, and filter them into a qualifying subset of articles (currently people, and sports). Article text will be parsed into paragraphs and scored by the tone check machine learning model. CirrusSearch will then be updated with a weighted tag, and Cassandra updated with the output of the tone check model. The GrowthExperiments extension will then use CirrusSearch to obtain a list of candidate tasks, and the Data Gateway to retrieve the tone checks themselves. See Figure 1 below.
| Fig. 1 Solution diagram |
Data model
| Column | Type | Description |
|---|---|---|
| wiki_id | text | Wiki or project |
| page_id | unsigned int | Corresponds to MediaWiki's page_id |
| model_version | text | Version of the tone check model used |
| revision_id | unsigned in | Corresponds to MediaWiki's rev_id (the revision ID) |
| paragraphs | set<tuple<text, int, float>> | Paragraphs and their corresponding index, and tone check score |
Estimating size
| Column | Estimated Size | Explanation |
|---|---|---|
| wiki_id | 6 bytes | Most of the ids are 8 chars in length |
| page_id | 8 bytes | 64 bit unsigned integer |
| model_version | 20 bytes | Estimated |
| revision_id | 8 bytes | 64 bit unsigned integer |
| paragraphs | ?? * (?? + 4 bytes + 4 bytes) | Number of paragraphs * size of paragraph text, index, and score |
This gives us ~??? bytes per row (not accounting for compression or overheads). Since we anticipate ~?? entries, we can estimate the size of data as ?? rows × ?? bytes ≈ ?? GB.
Performance
Ownership
Machine-Learning-Team
Contacts: @AikoChou, @isarantopoulos
Expiration
31-12-2026
Decision brief
We will use Cassandra for persistence, and provision storage on the Generated Data (née AQS) cluster.
CREATE TABLE ml_cache.page_paragraph_tone_scores ( wiki_id text, page_id bigint, revision_id bigint, model_version text, content text, score float, idx int, PRIMARY KEY ((wiki_id, page_id), revision_id, model_version, idx) );
Naming policy/convention
The table is named page_paragraph_tone_scores to explicitly reflect its association with MediaWiki page entities and to promote consistent naming across derived datasets. While the data records tone scores for individual paragraphs, paragraphs themselves do not have a corresponding MediaWiki entity concept and only exist in the context of specific pages (and/or revisions). Including "page" in the name makes that dependency clear, aids long-term discoverability in shared data catalogs and APIs, and aligns with existing and anticipated conventions for page-based derived data such as ML predictions or structured task outputs.
On the possible mischaracterization of the problem
A number of the issues surfaced during design seem to be the result of a mischaracterization of the problem, and perhaps, from an early bias as to what the solution ought to be; From the earliest stages of this project, we've characterized the requirements in such a way as to almost guarantee it would be placed behind the Data Gateway in the Generated Datasets (AQS) cluster. In reality, this use-case seems much more closely aligned with the workloads we typically place in the RESTBase cluster, where bespoke microservices use Cassandra tables in what effectively amounts to cache (durable, preemptive caching).
For the purposes of this section, we can define a cache as storage for the deterministic output of a process. For tone checking, Lift Wing is the process, and tone check output is deterministic (given a model, it will repeatedly produce the same results for a given revision). This use-case does require that we preemptively perform tone checks so that search tags can be added, and it does make sense to store the result since a) we have already sunk resources into computing it, b) it will not change unless the revision has been superseded by another, and c) it optimizes performance, but is it required? A service that handled misses and transparently recomputed and re-stored tone checks would close the loop (the potential performance implications notwithstanding).
Proper caching semantics could potentially solve the issues documented here around storage "leakage" (see: On the indiscriminate issuance of deletes below), as TTLs could be applied to create an upper bound on such leakage. The length of the TTL could be adjusted to balance the frequency/likelihood of misses with the latency penalty (i.e. typical caching tuning).
It's worth noting that T402984: Data Persistence Design Review: Article topic model caching (which has been running concurrently to this task) is very similar¹, and is taking a cache strategy.
[1]: Differing mainly in the aforementioned need to set search tags.
On the use of the Generated Data/AQS cluster
Image-suggestions notwithstanding (which is an outlier in almost every respect), there is a pretty clear delineation between the workloads currently deployed on the Generated Data Platform cluster (AQS), and the RESTBase cluster (no catchy rename, yet). The former is aggregated time-series data, and the latter revision-associated content that must be updated/invalidated when the document changes (typically triggered by changeprop). Tone check seems to share more in common with what we've been deploying on the RESTBase cluster, and maintaining that distinction could be really beneficial.
For example: The update patterns are sufficiently different that we could almost certainly benefit (longer-term) from different hardware specifications (node density, I/O, etc) and/or optimizations Likewise, I think we are increasingly seeing the need for a paved pathway for handling revision-based content updates from event streams, and encapsulating that at a single environment might also prove useful (and simpler).
The only sticking point I see is that use-cases on the RESTBase cluster typically implement an HTTP service that conforms to HTTP cache semantics. The service handles client reads, and changeprop jobs issue requests with Cache-control: no-cache headers to trigger updates. The Data Gateway doesn't (currently) cover the RESTBase cluster.
Options (in no particular order):
Deploy to RESTBase cluster w/ a service that implements HTTP cache semantics (no DG)...out of scope for now- Deploy to RESTBase w/ an instance of the Data Gateway
- Deploy to RESTBase w/ existing Data Gateway extended to cover both clusters
- Deploy to Generated Data Platform (AQS) cluster + Data Gateway
On the indiscriminate issuance of deletes
Take for example a scenario where the Tone Suggestion Generator (see Fig. 1) has parsed an article, applied the model, and determined that no tone check is warranted for a revision. Short of a preemptive read, there is no state that would tell us whether this revision had a tone issue previously; There is no way for us to know whether or not we need to cleanup older records. Given that Cassandra is log structured, issuing a delete (whether there is a record to delete or not), is cheaper than a read (let alone a read, followed by a delete), so we've opted to issue deletes in all cases. This means that for every change event where an article passes the filter and the model is applied, a Cassandra write of some kind will be performed, even when it is not to store a tone check.
MVCC
Tone check falls into a (very) common category of services that persist artifacts corresponding to objects in MediaWiki (in this case, articles). Additionally, the persisted artifact corresponds to the content of an article, something which can change arbitrarily from one edit to the next, so in actuality, the artifact corresponds to a revision of an article. Modelling such a service properly therefore requires modelling the revision as well.
A number of possibilities were discussed with respect to revisions, the simplest of which was to index only on page ID (and wiki), but persist the revision in order to late-filter/error handle revision mismatches. Ultimately though, the decision was made to implement an MVCC pattern. Change events will be used to recompute tone check scores, and update the dataset, which is indexed by revision. A range delete will be used to cleanup up earlier entries (if any). This update model will guarantee correctness and is concurrency safe.
Implementing this MVCC approach is more complex though, and given how common these requirements are, Data-Engineering and Data-Persistence have committed to investigating tools and infrastructure for a paved pathway, (and one that keeps a future migration of tone check in mind).
On dev/test environments (or the lack thereof)
TODOs/Remaining Questions
- Size estimations (see: Proposal section above)
- Performance expectations
- Ownership
- Expiration
- Determine destination cluster (see: On the use of the Generated Data/AQS cluster)
- Need some documentation on dev/test environment requirements (see: On dev/test environments (or the lack thereof))
Questions
Q: Is the article set deterministic? Not every article will be checked for tone; The design says that filtering will be applied, and that only a subset of articles will be considered. Is this guaranteed to always be the same subset? If not, how likely is it that we would persist tone checks that would subsequently be omitted by the filtering? Is there potential here to "leak" tone check storage and/or fail to remove weighted tags in search?
A: We should not expect the subset to be stable (i.e. there is the potential to leak tone scores)
