Page MenuHomePhabricator

Data Persistence Design Review: Improve Tone Suggested Edits newcomer task
Closed, ResolvedPublic

Description

The Growth-Team and Machine-Learning-Team are working to create a GrowthExperiments newcomer task for suggesting edits to improve tone (associated hypotheses are WE1.1.2 and WE1.1.8 respectively).


Proposal

Changeprop will be utilized to process article change events, and filter them into a qualifying subset of articles (currently people, and sports). Article text will be parsed into paragraphs and scored by the tone check machine learning model. CirrusSearch will then be updated with a weighted tag, and Cassandra updated with the output of the tone check model. The GrowthExperiments extension will then use CirrusSearch to obtain a list of candidate tasks, and the Data Gateway to retrieve the tone checks themselves. See Figure 1 below.

revise_tone.png (786×2 px, 130 KB)
Fig. 1 Solution diagram
Data model
ColumnTypeDescription
wiki_idtextWiki or project
page_idunsigned intCorresponds to MediaWiki's page_id
model_versiontextVersion of the tone check model used
revision_idunsigned inCorresponds to MediaWiki's rev_id (the revision ID)
paragraphsset<tuple<text, int, float>>Paragraphs and their corresponding index, and tone check score
Estimating size
TODO: Do.
ColumnEstimated SizeExplanation
wiki_id6 bytesMost of the ids are 8 chars in length
page_id8 bytes64 bit unsigned integer
model_version20 bytesEstimated
revision_id8 bytes64 bit unsigned integer
paragraphs?? * (?? + 4 bytes + 4 bytes)Number of paragraphs * size of paragraph text, index, and score

This gives us ~??? bytes per row (not accounting for compression or overheads). Since we anticipate ~?? entries, we can estimate the size of data as ?? rows × ?? bytes ≈ ?? GB.

Performance
TODO: Do.
  • P50 Read latency:
  • P95 Read latency:
  • P99 Read latency:
  • Read throughput:
  • Write throughput:
Ownership

Machine-Learning-Team
Contacts: @AikoChou, @isarantopoulos

Expiration

31-12-2026

Decision brief

We will use Cassandra for persistence, and provision storage on the Generated Data (née AQS) cluster.

CREATE TABLE ml_cache.page_paragraph_tone_scores  (
    wiki_id        text,
    page_id        bigint,
    revision_id    bigint,
    model_version  text,
    content        text,
    score          float,
    idx            int,
    PRIMARY KEY ((wiki_id, page_id), revision_id, model_version, idx)
);
Naming policy/convention

The table is named page_paragraph_tone_scores to explicitly reflect its association with MediaWiki page entities and to promote consistent naming across derived datasets. While the data records tone scores for individual paragraphs, paragraphs themselves do not have a corresponding MediaWiki entity concept and only exist in the context of specific pages (and/or revisions). Including "page" in the name makes that dependency clear, aids long-term discoverability in shared data catalogs and APIs, and aligns with existing and anticipated conventions for page-based derived data such as ML predictions or structured task outputs.

On the possible mischaracterization of the problem
This section is a work-in-progress

A number of the issues surfaced during design seem to be the result of a mischaracterization of the problem, and perhaps, from an early bias as to what the solution ought to be; From the earliest stages of this project, we've characterized the requirements in such a way as to almost guarantee it would be placed behind the Data Gateway in the Generated Datasets (AQS) cluster. In reality, this use-case seems much more closely aligned with the workloads we typically place in the RESTBase cluster, where bespoke microservices use Cassandra tables in what effectively amounts to cache (durable, preemptive caching).

For the purposes of this section, we can define a cache as storage for the deterministic output of a process. For tone checking, Lift Wing is the process, and tone check output is deterministic (given a model, it will repeatedly produce the same results for a given revision). This use-case does require that we preemptively perform tone checks so that search tags can be added, and it does make sense to store the result since a) we have already sunk resources into computing it, b) it will not change unless the revision has been superseded by another, and c) it optimizes performance, but is it required? A service that handled misses and transparently recomputed and re-stored tone checks would close the loop (the potential performance implications notwithstanding).

Proper caching semantics could potentially solve the issues documented here around storage "leakage" (see: On the indiscriminate issuance of deletes below), as TTLs could be applied to create an upper bound on such leakage. The length of the TTL could be adjusted to balance the frequency/likelihood of misses with the latency penalty (i.e. typical caching tuning).

It's worth noting that T402984: Data Persistence Design Review: Article topic model caching (which has been running concurrently to this task) is very similar¹, and is taking a cache strategy.

[1]: Differing mainly in the aforementioned need to set search tags.

On the use of the Generated Data/AQS cluster
This section is a work-in-progress

Image-suggestions notwithstanding (which is an outlier in almost every respect), there is a pretty clear delineation between the workloads currently deployed on the Generated Data Platform cluster (AQS), and the RESTBase cluster (no catchy rename, yet). The former is aggregated time-series data, and the latter revision-associated content that must be updated/invalidated when the document changes (typically triggered by changeprop). Tone check seems to share more in common with what we've been deploying on the RESTBase cluster, and maintaining that distinction could be really beneficial.

For example: The update patterns are sufficiently different that we could almost certainly benefit (longer-term) from different hardware specifications (node density, I/O, etc) and/or optimizations Likewise, I think we are increasingly seeing the need for a paved pathway for handling revision-based content updates from event streams, and encapsulating that at a single environment might also prove useful (and simpler).

The only sticking point I see is that use-cases on the RESTBase cluster typically implement an HTTP service that conforms to HTTP cache semantics. The service handles client reads, and changeprop jobs issue requests with Cache-control: no-cache headers to trigger updates. The Data Gateway doesn't (currently) cover the RESTBase cluster.

Options (in no particular order):

  • Deploy to RESTBase cluster w/ a service that implements HTTP cache semantics (no DG) ...out of scope for now
  • Deploy to RESTBase w/ an instance of the Data Gateway
  • Deploy to RESTBase w/ existing Data Gateway extended to cover both clusters
  • Deploy to Generated Data Platform (AQS) cluster + Data Gateway
QUESTION: What is Tone Suggestion Generator (as referenced in F66750510)? Is it architecturally compatible with the sort of HTTP service mentioned above?
On the indiscriminate issuance of deletes
This section is a work-in-progress

Take for example a scenario where the Tone Suggestion Generator (see Fig. 1) has parsed an article, applied the model, and determined that no tone check is warranted for a revision. Short of a preemptive read, there is no state that would tell us whether this revision had a tone issue previously; There is no way for us to know whether or not we need to cleanup older records. Given that Cassandra is log structured, issuing a delete (whether there is a record to delete or not), is cheaper than a read (let alone a read, followed by a delete), so we've opted to issue deletes in all cases. This means that for every change event where an article passes the filter and the model is applied, a Cassandra write of some kind will be performed, even when it is not to store a tone check.

MVCC

Tone check falls into a (very) common category of services that persist artifacts corresponding to objects in MediaWiki (in this case, articles). Additionally, the persisted artifact corresponds to the content of an article, something which can change arbitrarily from one edit to the next, so in actuality, the artifact corresponds to a revision of an article. Modelling such a service properly therefore requires modelling the revision as well.

A number of possibilities were discussed with respect to revisions, the simplest of which was to index only on page ID (and wiki), but persist the revision in order to late-filter/error handle revision mismatches. Ultimately though, the decision was made to implement an MVCC pattern. Change events will be used to recompute tone check scores, and update the dataset, which is indexed by revision. A range delete will be used to cleanup up earlier entries (if any). This update model will guarantee correctness and is concurrency safe.

Implementing this MVCC approach is more complex though, and given how common these requirements are, Data-Engineering and Data-Persistence have committed to investigating tools and infrastructure for a paved pathway, (and one that keeps a future migration of tone check in mind).

On dev/test environments (or the lack thereof)
This section is a work-in-progress
TODO: Do.

TODOs/Remaining Questions

  • Size estimations (see: Proposal section above)
  • Performance expectations
  • Ownership
  • Expiration
  • Determine destination cluster (see: On the use of the Generated Data/AQS cluster)
  • Need some documentation on dev/test environment requirements (see: On dev/test environments (or the lack thereof))
Questions

Q: Is the article set deterministic? Not every article will be checked for tone; The design says that filtering will be applied, and that only a subset of articles will be considered. Is this guaranteed to always be the same subset? If not, how likely is it that we would persist tone checks that would subsequently be omitted by the filtering? Is there potential here to "leak" tone check storage and/or fail to remove weighted tags in search?

A: We should not expect the subset to be stable (i.e. there is the potential to leak tone scores)

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Eevans updated the task description. (Show Details)
Eevans updated the task description. (Show Details)

I've created a draft merge-request here: https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9, please have a look and let me know if —among other things— there are any issues with the keyspace and table names (which by convention are exposed in the DG urls), attribute names (which by convention will be returned in JSON results), or the order/disposition of URL parameters (I've ordered them differently to how they appear in the schema). Also, let me know whether or not you think we should include all of the attributes in the results. For example, do you want wiki_id, page_id, etc, in the results given that presumably the caller will know them (having just supplied them as query parameters).


That said: I'd like to suggest a last minute revision to the schema itself.

The currently agreed upon schema is this...

CREATE TABLE ml_cache.tone_check  (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    revision_id   bigint,
    paragraphs    set<tuple<text, int, float>>,  -- paragraph text, index, score
    PRIMARY KEY((wiki_id, page_id), model_version, revision_id)
);

...inserts for that will look like...

INSERT INTO ml_cache.tone_check
    (wiki_id, page_id, model_version,revision_id,paragraphs) 
VALUES
    ('enwiki', 1, 'v1', 12, {('rain in spain', 0, 0.8), ('falls in plains', 1, 0.9)});

...and results will look like:

]
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 12,
        "paragraphs": [["falls in plains", 1, 0.9], ["rain in spain", 0, 0.8]]
    },
]
NOTE: One row object, with a paragraphs collection, containing a sequence of paragraph tuples

What I would like to suggest we do instead is...

CREATE TABLE ml_cache.tone_check_alt (
    wiki_id text,
    page_id bigint,
    model_version text,
    revision_id bigint,
    paragraph int,
    content text,
    score float,
    PRIMARY KEY ((wiki_id, page_id), model_version, revision_id, paragraph)
)

...which will make inserts look like this instead....

INSERT INTO ml_cache.tone_check_alt
    (wiki_id, page_id, model_version, revision_id, paragraph, content, score)
VALUES
    ('enwiki', 1, 'v1', 10, 0, 'rain in Spain', 0.8);

INSERT INTO ml_cache.tone_check_alt
    (wiki_id, page_id, model_version, revision_id, paragraph, content, score)
VALUES
    ('enwiki', 1, 'v1', 10, 1, 'falls on the plains', 0.8);
NOTE: One INSERT statement per paragraph

...and results will look like:

[
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 10,
        "paragraph": 0,
        "content": "rain in Spain",
        "score": 0.8
    },
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 10,
        "paragraph": 1,
        "content": "falls on the plains",
        "score": 0.8
    },
]
NOTE: One row object per paragraph.

The data model this supports is the same, what is different is how the database will encode and return results.

I like it! Some field naming suggestions:

  • paragraph_index int
  • paragraph_content text
  • tone_issue_probability float

I know it is more verbose (we could bike shed a bit), but I think it is much more clear as to what the fields are, and also matches naming in a other places, e.g.:

LiftWing API:

curl https://api.wikimedia.org/service/lw/inference/v1/models/edit-check:predict -X POST -d '{ "instances": [{"lang": "en", "check_type": "tone", "original_text": "text", "modified_text": "this is a great example of work", "page_title": "test"}]}' | jq .
{
  "message": "",
  "batchId": "fce36607-fbf7-4d0f-a103-8a58e06917c9",
  "predictions": [
    {
      "check_type": "tone",
      "details": {},
      "language": "en",
      "model_name": "edit-check",
      "model_version": "v1",
      "page_title": "test",
      "prediction": true,
      "probability": 0.847,
      "status_code": 200
    }
  ]
}

prediction classification change event schema, which uses probabilities term instead of scores (schema designed in T331401)

QUESTION: What is Tone Suggestion Generator (as referenced in F66750510)? Is it architecturally compatible with the sort of HTTP service mentioned above?

Tone Suggestion Generator is a streaming application deployed to Lift Wing that processes "page_content_change" events to generate tone suggestion tasks and update the necessary systems for downstream use.

It is not expected to handle client reads, which are handled by Data Gateway. The ML team don't have a preference on the RESTBase or AQS cluster. But we would like to use Data Gateway since Growth team has been using it for other structured tasks.

Tone Suggestion Generator does following things:

  • consumes the mediawiki.page_content_change.v1 events (triggered by changeprop)
  • filters pages, processes and parses content into paragraphs, then gets score from the tone check model
  • updates (set/clear) search index by emitting mediawiki.cirrussearch.page_weighted_tags_change.v1 events
  • updates (insert/delete) data in Cassandra

... any issues with the keyspace and table names (which by convention are exposed in the DG urls)

I'd suggest to change the keyspace name from ml_cache to something like ml_structured_task. From the ML perspective, we don't implement caching for the model. This is generated data stored specifically for structured tasks.

The data model this supports is the same, what is different is how the database will encode and return results.

I think on ML side, changing to execute one INSERT statement per paragraph is ok when updating. What about DELETE? Can we delete all paragraphs for an old revision of the page?
For client reads, does this mean the query would be the same when retrieving all paragraphs for a page's latest revision, only the return results formatted differently? @Michael, does the Growth team have a preference between these two JSON results?

QUESTION: What is Tone Suggestion Generator (as referenced in F66750510)? Is it architecturally compatible with the sort of HTTP service mentioned above?

[ ... ]
It is not expected to handle client reads, which are handled by Data Gateway. The ML team don't have a preference on the RESTBase or AQS cluster. But we would like to use Data Gateway since Growth team has been using it for other structured tasks.

Ok, so if we go to production on the RESTBase cluster we'll just have to have Data Gateway access there (either a different instance, or the current one extended to cover both clusters).

... any issues with the keyspace and table names (which by convention are exposed in the DG urls)

I'd suggest to change the keyspace name from ml_cache to something like ml_structured_task. From the ML perspective, we don't implement caching for the model. This is generated data stored specifically for structured tasks.

So... the name ml_cache came from a cluster that Machine-Learning-Team has, one setup specifically for this purpose. It was never used, and the decision to use one of the extant clusters was taken instead.

But also, it kind of is cache isn't? Granted this would be preemptive caching (or prefetching). And granted, we don't have a workflow that would allow to treat it as ephemeral (if we lost it, client lookups would simply fail). But... we are taking the output of a computation and...stashing it away for future use, to avoid having to compute it inline. And the whole invalidation part of the pipeline is very caching-like.

I'm not quite bikeshedding a name here by the way, I have a possible infrastructure improvement I'm mulling over, one that would treat use-cases like this one more like a special purpose cache.

TL;DR you don't have to take the bait on this one, if you don't want to. :)

The data model this supports is the same, what is different is how the database will encode and return results.

I think on ML side, changing to execute one INSERT statement per paragraph is ok when updating. What about DELETE? Can we delete all paragraphs for an old revision of the page?

Oh yes, the DELETE remains the same; One delete to remove all past revision!

For client reads, does this mean the query would be the same when retrieving all paragraphs for a page's latest revision, only the return results formatted differently? @Michael, does the Growth team have a preference between these two JSON results?

Exactly, the query doesn't change.

consumes the mediawiki.page_content_change.v1 events (triggered by changeprop)

Oh right mediawiki.page_content_change.v1. I think this will require a new rule in change-prop then. IIRC, right now, LiftWing is called on mediawiki.page_change.v1 events.

suggest to change the keyspace name from ml_cache to something like ml_structured_task

Maybe just structured_task? The fact that ML is involved in the task generation seems like an implementation detail? We could use this keyspace for other structured tasks too.

But also, it kind of is cache isn't?

I'm not sure if it is! At the very least, it is not a read-through cache. But as we discussed in slack, the line is blurry.

In this case, there is no way to make a request for the task content to be generated. Tasks are the data stored in this cassandra table. If the data isn't there, it can't be obtained by users. If we had done a batch task generation approach, I think this would be more apparent.

This is different than the ML article topic prediction cache (I think?). In that case, the user can make a request to LiftWing for a topic prediction, if not stored in cache, it will be computed at request time and returned to user.

I've created a draft merge-request here: https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9, please have a look and let me know if —among other things— there are any issues with the keyspace and table names (which by convention are exposed in the DG urls), attribute names (which by convention will be returned in JSON results), or the order/disposition of URL parameters (I've ordered them differently to how they appear in the schema).

From that MR I'm having trouble deciphering what the requests pattern would look like exactly. Would it just be http:://localhost:1234/ml_cache/{wiki_id}/{page_id}? (With localhost:1234 being the internal URL for the DataGateway.) I would have expected that to include something about Revise Tone? Note that for image suggestions it is a bit more elaborate: /public/image_suggestions/suggestions/{wiki}/{page_id} (per the docs).

Also, let me know whether or not you think we should include all of the attributes in the results. For example, do you want wiki_id, page_id, etc, in the results given that presumably the caller will know them (having just supplied them as query parameters).

Yes, they should be included in the returned result.

The data model this supports is the same, what is different is how the database will encode and return results.

I think on ML side, changing to execute one INSERT statement per paragraph is ok when updating. What about DELETE? Can we delete all paragraphs for an old revision of the page?
For client reads, does this mean the query would be the same when retrieving all paragraphs for a page's latest revision, only the return results formatted differently? @Michael, does the Growth team have a preference between these two JSON results?

For us, both are fine.

I like it! Some field naming suggestions:

  • paragraph_index int
  • paragraph_content text
  • tone_issue_probability float

I like these!

But also, it kind of is cache isn't?

I'm not sure if it is! At the very least, it is not a read-through cache. But as we discussed in slack, the line is blurry.

In this case, there is no way to make a request for the task content to be generated. Tasks are the data stored in this cassandra table. If the data isn't there, it can't be obtained by users. If we had done a batch task generation approach, I think this would be more apparent.

No, I know, I even mentioned as much in my reply to @AikoChou. You could implement that request though, in fact, it's almost like everything is there and we've simply elided it (which you can, if you assume storage is durable, and lives in perpetuity, etc). I think our motivation in persisting it though, the purpose of doing so, aligns pretty well with that of "cache".

This is different than the ML article topic prediction cache (I think?). In that case, the user can make a request to LiftWing for a topic prediction, if not stored in cache, it will be computed at request time and returned to user.

Yup, but is it otherwise different in any meaningful way? Is there any reason one would need to be done differently than the other?

P.S. I'm not trying to drive this work in a different direction at this stage; This is fodder for the decision brief, and future work!

I've created a draft merge-request here: https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9, please have a look and let me know if —among other things— there are any issues with the keyspace and table names (which by convention are exposed in the DG urls), attribute names (which by convention will be returned in JSON results), or the order/disposition of URL parameters (I've ordered them differently to how they appear in the schema).

From that MR I'm having trouble deciphering what the requests pattern would look like exactly. Would it just be http:://localhost:1234/ml_cache/{wiki_id}/{page_id}? (With localhost:1234 being the internal URL for the DataGateway.) I would have expected that to include something about Revise Tone? Note that for image suggestions it is a bit more elaborate: /public/image_suggestions/suggestions/{wiki}/{page_id} (per the docs).

As it currently is in the merge-request, it would be: /public/ml_cache/tone_check/{model_version}/{wiki_id}/{page_id}/{revision_id}

Yup, but is it otherwise different in any meaningful way?

Technically, maybe not. But in terminology/usage/common understanding maybe! But yes, agree that we should sidetrack this discussion for larger stuff, as is this is fine!

I'm not quite bikeshedding a name here by the way, I have a possible infrastructure improvement I'm mulling over, one that would treat use-cases like this one more like a special purpose cache.

What you all said makes a lot of sense! Regarding whether it's cache or not, the line is blurry. The difference between article topic prediction cache and this use case is like what @Ottomata described. Also article topic prediction cache is not created for a strucured task and we decided not to use Data Gateway for it.
So I'm fine with the name ml_cache if we consider it like a special purpose cache.

Oh right mediawiki.page_content_change.v1. I think this will require a new rule in change-prop then. IIRC, right now, LiftWing is called on mediawiki.page_change.v1 events.

Yep, we'll use mediawiki.page_content_change.v1. I think we just need to change the kafka_topic in change-prop, right?

Oh yes, the DELETE remains the same; One delete to remove all past revision!

Ok, then I think the proposed schema works!

Yep, we'll use mediawiki.page_content_change.v1. I think we just need to change the kafka_topic in change-prop, right?

Ya I believe so! A new rule for your thang with that as kafka_topic :)

I've created a draft merge-request here: https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9, please have a look and let me know if —among other things— there are any issues with the keyspace and table names (which by convention are exposed in the DG urls), attribute names (which by convention will be returned in JSON results), or the order/disposition of URL parameters (I've ordered them differently to how they appear in the schema).

From that MR I'm having trouble deciphering what the requests pattern would look like exactly. Would it just be http:://localhost:1234/ml_cache/{wiki_id}/{page_id}? (With localhost:1234 being the internal URL for the DataGateway.) I would have expected that to include something about Revise Tone? Note that for image suggestions it is a bit more elaborate: /public/image_suggestions/suggestions/{wiki}/{page_id} (per the docs).

As it currently is in the merge-request, it would be: /public/ml_cache/tone_check/{model_version}/{wiki_id}/{page_id}/{revision_id}

I'm not sure about having {model_version} in the path there. While it is important to have that version in the data that is returned, having it in the path means that every change to the model absolutely requires a change to the application code as well. Unless, we support something like /latest/ for that. But if we support /latest/, then why have the {model_version} in the path in the first place?

every change to the model absolutely requires a change to the application code as well

This is probably a good thing. IIUC, model_version rarely changes, but if it does, you probably want to have a managed upgrade path. This also would give you the ability to A/B test serving different model versions. I would expect when this happens that ML could generate and store tasks using both models, until we are sure the new model_version is the one to use for sure.

Which...points me to a different naming suggestions...

every change to the model absolutely requires a change to the application code as well

This is probably a good thing. IIUC, model_version rarely changes, but if it does, you probably want to have a managed upgrade path. This also would give you the ability to A/B test serving different model versions. I would expect when this happens that ML could generate and store tasks using both models, until we are sure the new model_version is the one to use for sure.

Which...points me to a different naming suggestions...

I doubt A/B testing based on model data will be a common occurrence at all. And if it should happen, we can differentiate based on the model version in the response data. On a separate note, I actually expect the model version to change relatively often in the beginning as we retrain the model or adjust its data-cleaning pipeline based on user feedback.

But maybe I'm misunderstanding what this model_version is intended to represent?

...and also back to the 'is it a cache' discussion!

CREATE TABLE ml_cache.tone_check (

I was about to comment that we should name this table more specifically about structured tasks, e.g. page_task_tone_issue or something. But then I realized that (belatedly, sorry!) that the data model indeed is storing simply the outputs of the tone_check model endpoint for specific pages instead of for random text input.

Even though we only intend to store predictions about a very specific kind of article (sports, etc.), there is no reason why this table couldn't be used as a cache in the same way that article topic prediction cache will be.

I don't love the cache naming because I think a (read through) cache is a more specific kind of thing, but since what we are storing is the tone_check prediction values (about MW pages), I'm okay with thinking of this as a cache (and putting it in the ml_cache) namespace.

every change to the model absolutely requires a change to the application code as well

This is probably a good thing. IIUC, model_version rarely changes, but if it does, you probably want to have a managed upgrade path. This also would give you the ability to A/B test serving different model versions. I would expect when this happens that ML could generate and store tasks using both models, until we are sure the new model_version is the one to use for sure.

Managed upgrades was the impression I was under as well.

I doubt A/B testing based on model data will be a common occurrence at all. And if it should happen, we can differentiate based on the model version in the response data. On a separate note, I actually expect the model version to change relatively often in the beginning as we retrain the model or adjust its data-cleaning pipeline based on user feedback.

But maybe I'm misunderstanding what this model_version is intended to represent?

So when a model version changes, storage would be migrated to the new version...organically? I'm sure there must be quite a distribution of edit frequencies, is it OK to be serving tone checks from two or more model versions?

We can do this of course, but we're going to have revisit the schema again. The last iteration discussed for example won't work (well, it could, but it would require storing the model version per paragraph).

Before I propose alternative(s), is this an issue we have consensus on?

I doubt A/B testing based on model data will be a common occurrence at all. And if it should happen, we can differentiate based on the model version in the response data. On a separate note, I actually expect the model version to change relatively often in the beginning as we retrain the model or adjust its data-cleaning pipeline based on user feedback.

But maybe I'm misunderstanding what this model_version is intended to represent?

So when a model version changes, storage would be migrated to the new version...organically?

I don't have the professional background you all have. Could you all elaborate more your thinking for why you assume that a migration of storage would usually be required?
To me, for example changing how we remove paragraphs with quotes in the data-clearing pipeline, or further training of the model with additional samples, changes nothing about how the output of the model is stored in Cassandra and provided in Data Gateway, right?

Also, for reference, we do not have a model version (or "pipeline version") for image suggestions at all. Neither in the model nor in the data.

Or is your interpretation of "model version" closer to "data model version" and not "ML model version"? In that case it should probably be named "storage version" or "API version". (And then it would make sense to be in the path.)

I'm sure there must be quite a distribution of edit frequencies, is it OK to be serving tone checks from two or more model versions?

From my point of view, the model version would mainly be used for statistics. If there are multiple rows for the same article with different model version, we use the one with highest model version.

We can do this of course, but we're going to have revisit the schema again. The last iteration discussed for example won't work (well, it could, but it would require storing the model version per paragraph).

Why? I have trouble following how you get to this conclusion. Please explain, and keep in mind that I have basically no clue about how Cassandra works at all.

To me, whether we have a schema that is

[
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 12,
        "paragraphs": [["falls in plains", 1, 0.9], ["rain in spain", 0, 0.8]]
    },
]

or

[
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 10,
        "paragraph": 0,
        "content": "rain in Spain",
        "score": 0.8
    },
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 10,
        "paragraph": 1,
        "content": "falls on the plains",
        "score": 0.8
    },
]

does not seem to matter much? All else equal, I would prefer the format with the paragraphs being collected into one response: less work for us to puzzle these back together on our end.

Before I propose alternative(s), is this an issue we have consensus on?

I'm sorry, but I'm not clear on what you're proposing here that you're asking consensus for.

I think the key questions to clarify are:

  • What does model_version represent?

The model_version represents the Tone Check model we use for the Tone Suggestion Generator in LiftWing. Currently, only one version is available, which is trained by the Research Team and also used by the Editing Team. You can find it here. While we're building a retraining pipeline for this model, we haven't retrained it yet.
Note that any changes to data processing or cleaning in the Tone Suggestion Generator during the iterations won't change the model_version because we'll still use the same model for predictions.

  • Will the model version change relatively often?

I don't think it will change often for now (this quarter). We plan to retrain the model using (1) a newer dataset and (2) the latest version of the BERT base model, but we need to wait for our Airflow training pipeline to be ready. As for retraining based on user feedback, there are only initial ideas and discussions at this point (T393103).

...and also back to the 'is it a cache' discussion!

CREATE TABLE ml_cache.tone_check (

I was about to comment that we should name this table more specifically about structured tasks, e.g. page_task_tone_issue or something. But then I realized that (belatedly, sorry!) that the data model indeed is storing simply the outputs of the tone_check model endpoint for specific pages instead of for random text input.

Even though we only intend to store predictions about a very specific kind of article (sports, etc.), there is no reason why this table couldn't be used as a cache in the same way that article topic prediction cache will be.

I don't love the cache naming because I think a (read through) cache is a more specific kind of thing, but since what we are storing is the tone_check prediction values (about MW pages), I'm okay with thinking of this as a cache (and putting it in the ml_cache) namespace.

IIUC, you're okay with not naming this table more specifically about structured tasks?

I agree that there's no reason this table couldn't be used as a cache like the article topic prediction cache, or for all MW pages in the future (if we can handle the traffic), so a generic name is not a bad idea for me.

I was actually thinking page_tone_check, which implies that we applied the model to pages rather than to random text input.

IIUC, you're okay with not naming this table more specifically about structured tasks?

Yes, seeing as what you are storing is predictions about pages, not tasks work lists themselves.

I was actually thinking page_tone_check, which implies that we applied the model to pages rather than to random text input.

I find 'tone_check' to be kind of a weird name for what this is. You aren't storing a 'check'.

How about page_tone_prediction or page_tone_prediction_classification (matching the naming we chose for page/prediction_classification_change event schema).

Although, if we use Eric's latest proposed model, each record is about a paragraph on a page, not just a page? page_paragraph_tone_prediction ?

Just some ideas! The decision is yours!

Trying to address the blocking ones:

  • Naming

Keyspace: ml_cache
Table name: page_tone_check

I think "page_tone_check" is okay since the ML model is called Tone Check, and the data stored is outputs from the model.

  • URI disposition

/public/ml_cache/page_tone_check/{model_version}/{wiki_id}/{page_id}/{revision_id}

  • Result set

Given the table name "page_tone_check", I prefer each record to represent a page rather than a paragraph, and I prefer the attribute name "tone_issue_paragraphs" over "paragraphs":

[
    {
        "wiki_id": "enwiki",
        "page_id": 1,
        "model_version": "v1",
        "revision_id": 12,
        "tone_issue_paragraphs": [["falls in plains", 1, 0.9], ["rain in spain", 0, 0.8]]
    },
]
  • Schema
CREATE TABLE ml_cache.page_tone_check (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    revision_id   bigint,
    tone_issue_paragraphs    set<tuple<text, int, float>>,  -- paragraph text, index, score
    PRIMARY KEY((wiki_id, page_id), model_version, revision_id)
);

Questions
Is the article set deterministic? Not every article will be checked for tone; The design says that filtering will be applied, and that only a subset of articles will be considered. Is this guaranteed to always be the same subset? If not, how likely is it that we would persist tone checks that would subsequently be omitted by the filtering? Is there potential here to "leak" tone check storage and/or fail to remove weighted tags in search?

Very good question. Michael raised this before, and I've been thinking how to address it. The article set isn't deterministic in the design, so it's possible to persist tone checks that would later be omitted by filtering.

About filtering, from our previous analysis, we know we can find a higher percentage of tone issues in articles about people, sports, and those tagged with relevant templates indicating a tone issue. Article topics may change, though this wouldn't happen often. But templates can be added or removed much more frequently.

One way is to look up the table each time we get a new revision and check if that article exists before filtering. But as you mentioned in another section, this preemptive read is expensive.

Another idea is to skip filtering entirely. We'd do an initial article set ingest to Cassandra, then the Tone suggestion task generator would process all changed pages without filtering, so the article set only grows, never shrinks. This would simplify our implementation, but write throughput would be a different scale..

@Eevans and I just had a short call to clarify some of the open questions here. My summary (also to make sure I understand things correctly):

We need to differentiate between data-model and query-model. In the data-model, the data is laid out based on the primary key where there is a 1:many distribution between the components of the primary key from the left to the right.

That means that so schema so far requires us to also specify a model_version if we do want to specify the revision_id:

CREATE TABLE ml_cache.page_tone_check (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    revision_id   bigint,
    tone_issue_paragraphs    set<tuple<text, int, float>>,  -- paragraph text, index, score
    PRIMARY KEY((wiki_id, page_id), model_version, revision_id)
);

However, we can resolve this by moving the model_version all the way to the right in the primary key:

CREATE TABLE ml_cache.page_tone_check (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    revision_id   bigint,
    tone_issue_paragraphs    set<tuple<text, int, float>>,  -- paragraph text, index, score
    PRIMARY KEY((wiki_id, page_id), revision_id, model_version)
);

In the query model, we would always return all entries we have for the given revision:

SELECT * 
FROM table
WHERE wiki = 'en' 
  AND page_id = 42 
  AND revision_id = 1001;

The main question is how updates of the model version would be handled. If we only ever run the pipeline with incremental updates, then we might be running model version 4 and still have some never-edited article in the database with model version 1 (that is what was meant above by "organically"). However, it might be desirable when upgrading to do a new backfill of data with the new model version. Also, but very hypothetical, maybe we want to run two pipelines to support an A/B test of two models. In these cases, we may wish to have a way to delete entries with old model versions at some point.

Trying to address the blocking ones:

  • Naming

Keyspace: ml_cache
Table name: page_tone_check

I think "page_tone_check" is okay since the ML model is called Tone Check, and the data stored is outputs from the model.

I don't feel very strongly here, but this might be an opportunity to discuss nomenclature for this project more generally (now that I think of it, we probably should have started there). We have colloquially been referring to what we're storing as "tone checks". In this context, a "tone check" is something that exists on a per-paragraph basis (of course, the paragraphs are associated with a page, which in turn is associated with a wiki, etc). Presumably we're calling this thing —the structure we're storing— a "tone check" because that is the name of the ML model, which may or may not be a problem nomenclature-wise (using that name to mean two different things, depending on the context, I mean). However, it would be customary to name the table the plural of the thing being stored there, which in this case would be tone_checks? Think: a table that stored User objects called users, or one storing Post objects posts.

So, is tone check the right name for the unit(s) being stored? Is page tone check the right terminology? And I guess, if so, are we namespacing it by page because we might use a different unit later? Are we only ever going to store them if the score is above some threshold? If so, maybe the right term is tone issue? If not, maybe tone scores?

  • URI disposition

/public/ml_cache/page_tone_check/{model_version}/{wiki_id}/{page_id}/{revision_id}

  • Result set

Given the table name "page_tone_check", I prefer each record to represent a page rather than a paragraph, and I prefer the attribute name "tone_issue_paragraphs" over "paragraphs":

From a data model perspective we are storing objects, each of which represents a "tone check" (again, if that's the right name) for a specific paragraph. In other words, these objects/tone checks are inherently paragraph-based. Which is why I introduced the more normalized structure, the query is always for one page, the objects returned are always the paragraphs with issues, so it makes sense that the rows (the results) be one paragraph each.

That said: @Michael has raised the issue of whether we should be indexing by model_version or not, and if we do not, that will necessitate going back to a collection of paragraph structures. I think they will be meeting with you shortly to discuss this, so let's see how that pans out, and pick this up after.

[ ... ]

Questions
Is the article set deterministic? Not every article will be checked for tone; The design says that filtering will be applied, and that only a subset of articles will be considered. Is this guaranteed to always be the same subset? If not, how likely is it that we would persist tone checks that would subsequently be omitted by the filtering? Is there potential here to "leak" tone check storage and/or fail to remove weighted tags in search?

Very good question. Michael raised this before, and I've been thinking how to address it. The article set isn't deterministic in the design, so it's possible to persist tone checks that would later be omitted by filtering.

About filtering, from our previous analysis, we know we can find a higher percentage of tone issues in articles about people, sports, and those tagged with relevant templates indicating a tone issue. Article topics may change, though this wouldn't happen often. But templates can be added or removed much more frequently.

One way is to look up the table each time we get a new revision and check if that article exists before filtering. But as you mentioned in another section, this preemptive read is expensive.

Another idea is to skip filtering entirely. We'd do an initial article set ingest to Cassandra, then the Tone suggestion task generator would process all changed pages without filtering, so the article set only grows, never shrinks. This would simplify our implementation, but write throughput would be a different scale..

Maybe we could reintroduce the idea of TTLs? If we're worried about "leaked" storage like this, could we TTL every record? That might mean that a perfectly legitimate tone issue could simply disappear from storage if no edits occur within the TTL period, but maybe that's OK? That doesn't do anything to remove the weighted search tag though (which seems like a similar but separate issue the Cassandra one).

Eevans updated the task description. (Show Details)

@AikoChou, @BWojtowicz-WMF and I have met and discussed this more. Below is my summary:

About indexing and querying the model:

  • A primary key of PRIMARY KEY((wiki_id, page_id), revision_id, model_version) makes sense to us, as does still indexing the model_version. The querying would happen without specifying the model version.

About updating the model:

  • We agreed that the model_version in the data must follow a consistent and well-defined pattern, so that it can be both used in GrowthExperiments to pick the newest as well as in potential deletions based on it

About preventing "leaked" storage:

  • TTL's are not the preferred way to go here, because they could result in valid results getting deleted that still have an associated weighted tag in CirrusSearch. Our idea was whether we could expand the "indiscriminate deletion"-policy to be even more indiscriminate and delete on every change to pages in the main namespace, regardless of whether they pass the filter. That would substantially increase the amount of deletes, but should make sure we catch most relevant pages.

It’s good that we’re discussing this! I've learned a lot :)

  • Naming

Based on your feedback, I propose naming the table paragraph_tone_scores. Why:

if so, are we namespacing it by page because we might use a different unit later?

Namespacing it by page because that’s the "source" unit we use to process data for. We won't use a different source unit, so I think including page is unnecessary.

From a data model perspective we are storing objects, each of which represents a "tone check" (again, if that's the right name) for a specific paragraph. In other words, these objects/tone checks are inherently paragraph-based. Which is why I introduced the more normalized structure, the query is *always* for one page, the objects returned are always the paragraphs with issues, so it makes sense that the rows (the results) be one paragraph each.

This makes sense to me.

We have outputs on a per-paragraph basis, but this could be different if we want. It depends entirely on our data processing/parsing steps and how the model performs. We can feed a page, section, or sentence into the Tone Check model to get tone scores. We chose paragraphs because we think they balance having enough information for the model. Sentences may lack context, while pages or sections would exceed the current model's max context window. So I think it's good to include "paragraph" in the table name. In the future, we could create different tables for different units e.g. sentence or section.

Are we only ever going to store them if the score is above some threshold? If so, maybe the right term is tone issue? If not, maybe tone scores?

I think "tone_scores" would be better, because we may also want to store data which doesn't have tone issue. Not for this growth experiment, but more generally, we can use this data for model evaluation with a balanced dataset that includes both positive and negative samples.

  • Schema
CREATE TABLE ml_cache.paragraph_tone_scores (
    wiki_id       text,
    page_id       bigint,
    model_version text,
    revision_id   bigint,
    tone_score    set<tuple<text, int, float>>,  -- paragraph text, index, score
    PRIMARY KEY((wiki_id, page_id), revision_id, model_version)
);

or

CREATE TABLE ml_cache.paragraph_tone_scores (
    wiki_id text,
    page_id bigint,
    model_version text,
    revision_id bigint,
    paragraph_index int,
    paragraph_content text,
    tone_score float,
    PRIMARY KEY ((wiki_id, page_id), revision_id, paragraph, model_version)
)

Is it possible to do the above? As Michael said, Growth doesn't want to specify the model version when querying, and they can handle selecting the version on their end.
What are the benefits of using a normalized structure?

We won't use a different source unit, so I think including page is unnecessary.

Could we go with page_paragraph_tone_scores? This is specifically representing paragraphs belonging to MediaWiki pages, and the primary key very explicitly includes page_id.

Ok, I've updated https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9

The schema in the merge-request now looks like:

CREATE TABLE ml_cache.paragraph_tone_scores (
    wiki_id        text,
    page_id        bigint,
    revision_id    bigint,
    model_version  text,
    content        text,
    score          float,
    idx            int,
    PRIMARY KEY ((wiki_id, page_id), revision_id, model_version, idx)
);

Ok so, names: Based on @AikoChou's comment above (thanks for that!) we can now refer to the things we are storing as "paragraph tone scores", that is the unit of storage. Paragraph tone scores are inherently (and more importantly, explicitly) paragraph based, so I'm recommending idx, content, and score because they are the paragraph tone score index, content, and score value. It makes more sense to qualify the other attributes (wiki_id, page_id, revision_id, and model_version), but it seems superfluous for these. Also, the Data Gateway is more of an HTTP bridge to the database, rather than a proper API, so I think it makes sense to lean a bit more toward conventions for database naming than the sort we might favor for a public interface (we are the audience here).

By the way, I used idx there because index is a CQL keyword, and using reserved keywords in schema tends to be painful. 😣

Finally, I used (again) the version of the schema that doesn't store the scores in a collection, because that is what most closely matches your data model. The collection (i.e. the set<tuple<text, int, float>>) was always a hack/workaround we were using to simplify an overwrite of all scores at once (before we were using revision).

That said: I'm mostly employing Cunningham's Law here, so if others feel strongly about any of this then please speak up!

On to the rest....

The URI in the merge-request now looks like...

/public/ml_cache/paragraph_tone_checks/{wiki_id}/{page_id}/{revision_id}

Inserts would look like...

INSERT INTO ml_cache.tone_checks (wiki_id, page_id, revision_id, model_version, content, probability, idx)
    VALUES ('enwiki', 1, 10, 'v1', 'rain in spain', 0.2, 0);

INSERT INTO ml_cache.tone_checks (wiki_id, page_id, revision_id, model_version, content, probability, idx)
    VALUES ('enwiki', 1, 10, 'v1', 'falls mostly on the plains', 0.3, 1);

INSERT INTO ml_cache.tone_checks (wiki_id, page_id, revision_id, model_version, content, probability, idx)
    VALUES ('enwiki', 1, 10, 'v2', 'rain in spain', 0.5, 0);

INSERT INTO ml_cache.tone_checks (wiki_id, page_id, revision_id, model_version, content, probability, idx)
    VALUES ('enwiki', 1, 10, 'v2', 'falls mostly on the plains', 0.6, 1);

...and results would look like...

{
   "rows" : [
      {
         "content" : "rain in spain",
         "idx" : 0,
         "model_version" : "v1",
         "page_id" : 1,
         "score" : 0.2,
         "revision_id" : 10,
         "wiki_id" : "enwiki"
      },
      {
         "content" : "falls mostly on the plains",
         "idx" : 1,
         "model_version" : "v1",
         "page_id" : 1,
         "score" : 0.3,
         "revision_id" : 10,
         "wiki_id" : "enwiki"
      },
      {
         "content" : "rain in spain",
         "idx" : 0,
         "model_version" : "v2",
         "page_id" : 1,
         "score" : 0.5,
         "revision_id" : 10,
         "wiki_id" : "enwiki"
      },
      {
         "content" : "falls mostly on the plains",
         "idx" : 1,
         "model_version" : "v2",
         "page_id" : 1,
         "score" : 0.6,
         "revision_id" : 10,
         "wiki_id" : "enwiki"
      }
   ]
}

We won't use a different source unit, so I think including page is unnecessary.

Could we go with page_paragraph_tone_scores? This is specifically representing paragraphs belonging to MediaWiki pages, and the primary key very explicitly includes page_id.

I understand the point here, but I think it's clear that the data represents paragraphs from MediaWiki pages when we have page_id as part of primary key, and people have to provide page_id when querying. Isn't it? So I lean a bit more toward not including "page" in the table name.

Ok, I've updated https://gitlab.wikimedia.org/repos/sre/data-gateway/-/merge_requests/9
The schema in the merge-request now looks like:
[...]

The MR looks good to me!

In the examples, you still wrote "tone_checks", but I understand what you mean.

URI:

/public/ml_cache/paragraph_tone_scores/{wiki_id}/{page_id}/{revision_id}

Insert:

INSERT INTO ml_cache.paragraph_tone_scores (wiki_id, page_id, revision_id, model_version, content, score, idx)
    VALUES ('enwiki', 1, 10, 'v1', 'rain in spain', 0.2, 0);

@Eevans I also wanted to follow up on the next step for this task.

In yesterday's meeting with @Michael, we agreed to do a small ingestion for testwiki using mock articles. This will allow Growth build and test their implementation as soon as possible. We'd like to proceed with this once Cassandra and Data Gateway are ready.

My question is: what work is required from us to load a few mock records on staging Cassandra? Is this something can be done on your end?

[ ... ]

My question is: what work is required from us to load a few mock records on staging Cassandra? Is this something can be done on your end?

I think so, yes. If you have specific mock data in mind, a csv-formatted file should work.

I think so, yes. If you have specific mock data in mind, a csv-formatted file should work.

Good to know! :) To enable Growth to test their implementation on testwiki, we'll manually load small mock records to staging Cassandra.

  1. Create some mock articles on testwiki
  2. Provide a CSV-formatted file that follows this format for Eric:
wiki_id, page_id, revision_id, model_version, content, score, idx
'testwiki', 1, 10, 'v1', 'rain in spain', 0.2, 0

@Michael, could you create the mock articles and prepare this CSV file?
@dcausse, regarding the weighted search tag, IIUC, for a few mock records, the simplest way to produce events to CirrusSearch would be using kafkacat like P84613, right?

@dcausse, regarding the weighted search tag, IIUC, for a few mock records, the simplest way to produce events to CirrusSearch would be using kafkacat like P84613, right?

yes the event you crafted is perfect it's just missing the meta section, also please use event-gate to push them (otherwise you would bypass the various event platform validations):

curl -H"User-Agent: achou-T401021/wmf" -H"Content-Type: application/json" -XPOST https://eventgate-main.discovery.wmnet:4492/v1/events -d '[events]'

Event:

{
  "meta": {
    "stream": "mediawiki.cirrussearch.page_weighted_tags_change.v1",
    "domain": "test.wikipedia.org"
  },
  "dt": "2025-11-03T10:56:00Z",
  "wiki_id": "testwiki",
  "page": {
    "page_id": 1,
    "page_title": "SomePage",
    "namespace_id": 0,
    "is_redirect": false
  },
  "weighted_tags": {
    "set": {
      "recommendation.tone": [
        {
          "tag": "exists",
          "score": 1.0
        }
      ]
    }
  }
}

Could we go with page_paragraph_tone_scores?

I think it's clear that the data represents paragraphs from MediaWiki pages when we have page_id as part of primary key

Perhaps, but I'm suggesting this in anticipation of more MediaWiki page entity related derived data. In the future I think we will want to standardize on a table name and key for derived data keyed by page_id (ML predictions, structured tasks, etc.). If when we do, we'd likely want to include the MediaWiki entity name in all derived data tables like this, to make it clear which ones are about pages (and what their expected key is), which one is about users, etc. etc.

Could we go with page_paragraph_tone_scores?

I think it's clear that the data represents paragraphs from MediaWiki pages when we have page_id as part of primary key

Perhaps, but I'm suggesting this in anticipation of more MediaWiki page entity related derived data. In the future I think we will want to standardize on a table name and key for derived data keyed by page_id (ML predictions, structured tasks, etc.). If when we do, we'd likely want to include the MediaWiki entity name in all derived data tables like this, to make it clear which ones are about pages (and what their expected key is), which one is about users, etc. etc.

When the request is: /public/ml_cache/paragraph_tone_checks/{wiki}/{page}/{revision}, I think it makes it obvious that the paragraph tone checks are for the revision of a page of a wiki. If that is not the case then I think we have to also consider page_revision_paragraph_tone_scores, or even wiki_page_revision_paragraph_tone_scores.

 If that is not the case then I think we have to also consider page_revision_paragraph_tone_scores, or even wiki_page_revision_paragraph_tone_scores.

Ya, where to draw the contextual naming line a judgement call. Most of what we do is MediaWiki, so 'wiki' is already implicit in almost everything we do. Sometimes things we do are WMF-mediawiki specific.

In this case, we may create other tables about MediaWiki pages with a similar (key) data model and update pipeline. If I a newly hired engineer or product manager and I were browsing API docs or a data catalog for existent data about MediaWiki pages that I might be able to use for my new project, I would expect that those things were named consistently. I'd expect to find page related API and datasets that start with 'page', or at least have 'page' in the name somewhere.. E.g. In MW Core, 'pagelinks' is not 'links' (...I'm sure we could find counter example arguments in MW too! :)

If we do not prefix with page here, then we'll have to name-bikeshed each new table / API endpoint to decide if there is enough naming context to easily understand it is about pages.

When the request is: /public/ml_cache/paragraph_tone_checks/{wiki}/{page}/{revision}, I think it makes it obvious [...]

When the request is /public/ml_cache/paragraph_tone_checks/enwiki/12345/799290, I'm not sure it is obvious ;)


Anyway, I've made my case! I leave it to yall to make the final call. Thank you for reading!

Fyi, I've rearranged some of what I'm quoting here (I hope that's OK).

Anyway, I've made my case! I leave it to yall to make the final call. Thank you for reading!

I think plodding through all of this (without blocking the task's progress of course) is worthwhile. It's less about the name of this table, than it is that we have a rationale we can carry forward for all of the subsequent tasks. So thank you for taking the time to share!

When the request is: /public/ml_cache/paragraph_tone_checks/{wiki}/{page}/{revision}, I think it makes it obvious [...]

When the request is /public/ml_cache/paragraph_tone_checks/enwiki/12345/799290, I'm not sure it is obvious ;)

Maybe less obvious if you're parsing request logs, yes. :) Is that in scope?

[ ... ]

In this case, we may create other tables about MediaWiki pages with a similar (key) data model and update pipeline. If I a newly hired engineer or product manager and I were browsing API docs or a data catalog for existent data about MediaWiki pages that I might be able to use for my new project, I would expect that those things were named consistently. I'd expect to find page related API and datasets that start with 'page', or at least have 'page' in the name somewhere.. E.g. In MW Core, 'pagelinks' is not 'links' (...I'm sure we could find counter example arguments in MW too! :)

I've talked about the Data Gateway serving as a catalog of datasets, but that's not something that's ever been discussed in detail (we should do that soon!). Mostly though I based that on the notion that since there is no authentication, and any client inside our network can access it, reuse was possible. I think to make that useful though, you'd need to go further. I've thought about putting some bits in place so that you could commit openapi fragments in order to properly document. I've also (more than once) questioned whether it made sense to stick to the convention of deriving endpoints from keyspace & table names. Relaxing that requirement would let you add URI elements to affect namespacing too.

In short: Trying to make the Data Gateway both a database abstraction and a well-documented API might be asking too much, especially if the only tool at our disposal is keyspace and table names (which make for pretty blunt instruments).

If we do not prefix with page here, then we'll have to name-bikeshed each new table / API endpoint to decide if there is enough naming context to easily understand it is about pages.

But it's not about pages. It's about paragraphs, which are implicitly part of pages (revisions of pages), which are implicitly part of a wiki. That's kind of where I'm getting tripped up here, I get your rationale (and don't disagree with it), and it seems to be the same rationale that resulted in paragraph_ being prefixed to the table name. If anything, I imagine a future where a tone check (or similar) was page-based, at which point the name would be page_tone_checks.

But it's not about pages. It's about paragraphs, which are implicitly part of pages (revisions of pages),

True, but specifically about paragraphs that belong to MediaWiki pages. Paragraphs do not have a corresponding MediaWiki entity concept. Paragraphs do not have a unique id with which they can be referred to alone. They require a page_id (and/or revision_id) to be contextualized.

While some humans reading this thread may implicitly know that paragraph here specifically refers to a MediaWiki page paragraph, I don't think it is obvious enough for others.

More generally, while this table and other page entity derived tables in Cassandra are unlikely to be interacted with directly, any intermediate derived tables (e.g. a batch computed Data Lake table) will. We certainly don't need to align a (hypothetical) source Data Lake table name (convention) with a serving layer table name, but I think we should.

Not including "page" somewhere in the name of the table means it will be difficult for other re-users of this data to discover. Imagine being a product manager hired 5 years from now, and none of us who work on tone check tasks are at WMF anymore (say it ain't so!). The PM opens datahub (which could also include Cassandra tables) and tries to find all of the datasets that store prediction scores about pages. I don't think they will find paragraph_tone_checks.


Clearly I care about naming! But once again, I only put my arguments here and leave it to you to make the final call!

I think so, yes. If you have specific mock data in mind, a csv-formatted file should work.

Good to know! :) To enable Growth to test their implementation on testwiki, we'll manually load small mock records to staging Cassandra.

  1. Create some mock articles on testwiki
  2. Provide a CSV-formatted file that follows this format for Eric:
wiki_id, page_id, revision_id, model_version, content, score, idx
'testwiki', 1, 10, 'v1', 'rain in spain', 0.2, 0

@Michael, could you create the mock articles and prepare this CSV file?

There you go:

This is the data for 4 mock articles that I've now created on testwiki. I've used -1 for the idx column to signify that this value is missing (as it will be, based on my understanding, in the current version of the pipeline because it is based on wikitext.)

Clearly I care about naming! But once again, I only put my arguments here and leave it to you to make the final call!

Ok, while I continue to think this is a bit overkill ("word salad"), and that it's generally a mistake to rely entirely on table names to document), if this is a convention that we can stick to then I think that's fine. Two things though:

First, can someone document this in the decision brief section of the ticket? The idea would be to explain this naming policy/convention in a way that can be applied to subsequent projects. I would do that, but I'm not sure I understand it well enough.

Second (and I will document this one in the decision brief), I think going forward we need to decouple the notion of DG uris that are derived from the keyspace & table names (or at least soften that expectation). In other words, accept that the keyspace and table can be potato and carrot respectively, and the DG uri could be /public/apple/banana/lemon/{arg0}/{arg1}/. One reason for this is that Cassandra has a 48 character limit on table names, and page_paragraph_tone_checks is more than halfway there. It's not hard to imagine a name where the descriptive part combined with the namespace prefixing pushed past 48 chars (and I'd hate to see us torturing the descriptive part to make it fit). Other databases that we might one day put behind the Gateway could have limitations of their own. Finally, this might add the flexibility to be more descriptive, for example: /public/ml/tone_checks/page/{wiki}/{page}/{revision}, /public/ml/tone_checks/page/paragraphs/{wiki}/{page}/{revision}, etc (I'm not offering those as examples here, that's just meant to be demonstrative).

Given our tight timeline, we'd like to have Cassandra and the Data Gateway ready this week, so we can begin integrating with Lift Wing soon. I need to make the final call to move things forward. I've read through both of your points, they're all valid.

I think the main point of divergence is whether "paragraph" is sufficiently clear as referring to paragraphs within MediaWiki pages. We follow the pattern <entity>_<signal>. One view is that the entity is simply "paragraph." The other is that the entity should be "page_paragraph" which aligns with our data ecosystem (MediaWiki entities; paragraph is not a kind of MediaWiki entity) thus has better discoverability for future users.

After considering both perspectives, I'd like to go with page_paragraph_tone_scores. While I admit it looks a bit redundant, it may be beneficial for the future.

First, can someone document this in the decision brief section of the ticket? The idea would be to explain this naming policy/convention in a way that can be applied to subsequent projects. I would do that, but I'm not sure I understand it well enough.

I'll document this in the decision brief section later. @Ottomata, can you review it and add anything you think is missing?

Second (and I will document this one in the decision brief), I think going forward we need to decouple the notion of DG uris that are derived from the keyspace & table names (or at least soften that expectation).
[...]

I think this makes a lot of sense and provides more flexibility!

Lastly, thank you both for the thoughtful input. This has been a really valuable discussion for future reference. :)

Change #1201768 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/puppet@production] cassandra: add new grants to data-gateway role

https://gerrit.wikimedia.org/r/1201768

Change #1201768 merged by Eevans:

[operations/puppet@production] cassandra: add new grants to data-gateway role

https://gerrit.wikimedia.org/r/1201768

achou updated the task description. (Show Details)

Change #1201778 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] data-gateway (staging): deploy version v1.0.13

https://gerrit.wikimedia.org/r/1201778

Change #1201778 merged by jenkins-bot:

[operations/deployment-charts@master] data-gateway (staging): deploy version v1.0.13

https://gerrit.wikimedia.org/r/1201778

The table has been created, the mock data loaded, and v1.0.13 of the Gateway (w/ the new endpoint) has been deployed.

eevans@deploy2002:/srv/deployment-charts/helmfile.d/services/data-gateway$ curl https://data-gateway.k8s-staging.discovery.wmnet:30443/public/ml_cache/page_paragraph_tone_scores/testwiki/168753/680406 2>/dev/null |json_pp
{
   "rows" : [
      {
         "content" : "Schwarzwald Castle offers a wonderful glimpse into medieval life with well-preserved ruins and informative guided tours available in multiple languages. The scenic hilltop location provides excellent views of the surrounding Black Forest, making it a popular choice for families and history enthusiasts. The on-site museum showcases an impressive collection of medieval artifacts, and the castle grounds are perfect for a relaxing afternoon visit. With convenient parking and a charming café serving local specialties, Schwarzwald Castle makes for an enjoyable day trip from Freiburg.",
         "idx" : -1,
         "model_version" : "v1",
         "page_id" : 168753,
         "revision_id" : 680406,
         "score" : 0.886,
         "wiki_id" : "testwiki"
      }
   ]
}
eevans@deploy2002:/srv/deployment-charts/helmfile.d/services/data-gateway$

@dcausse Thanks a lot! I found it was also missing $schema. (eventgate complained about it)

$ cat mock_event.json 
{"$schema":"/mediawiki/cirrussearch/page_weighted_tags_change/1.0.0","meta":{"stream":"mediawiki.cirrussearch.page_weighted_tags_change.v1","domain":"test.wikipedia.org"},"dt":"2025-11-05T09:00:00Z","wiki_id":"testwiki","rev_based":true,"page":{"page_id":168753,"page_title":"Schwarzwald_Castle","namespace_id":0,"is_redirect":false},"weighted_tags":{"set":{"recommendation.tone":[{"tag":"exists","score":1.0}]}}}

$ curl -H"User-Agent: achou-T401021/wmf" -H"Content-Type: application/json" -XPOST https://eventgate-main.discovery.wmnet:4492/v1/events -d @mock_event.json

With above, I didn't get any eventgate validation error.

@Michael I created 4 mock events based on your csv file and have pushed them to the Search weighted tags. Let me know if you can see them.

Yep, we'll use mediawiki.page_content_change.v1. I think we just need to change the kafka_topic in change-prop, right?

I just realized a little important detail about this. mediawiki.page_content_change.v1 only exists in Kafka jumbo-eqiad. It is not multi DC.

We do not have a change-prop that consumes from Kafka jumbo-eqiad. The one we have consumes from Kafka main (either main-eqiad or main-codfw depending on which datacenter change-prop is in)

So, if we want to use change-prop to consume from mediawiki.page_content_change.v1 and hit LiftWing API, we need to do one of:

Option A. Produce mediawiki.page_content_change.v1 to Kafka main

This is my preferred option. I think having access to mediawiki.page_content_change.v1 and other streams like this will be useful for realtime updates for derived data products like this one.

The original reason this was not produced to Kafka main was that SRE was worried about polluting Kafka main with this stream that has large event bodies. Previously, the only user of this stream was for mediawiki_content_change_v1 in the Data Lake, so there was no reason to produce to Kafka main.

We should consider this and talk to SRE ServiceOps to see what they think.

Option B. New change-prop service consuming from Kafka jumbo

Ideally this wouldn't be too hard to do (although I'm not sure its helm chart is in good shape to make this easy). We'd have to figure out where to run it (dse-k8s-eqiad?).

This is my least preferred option. I don't want to deploy more change-props.

Option C. New change-prop rule consuming from Kafka jumbo

This would probably require:

  • A new change-prop route rule (/{api:sys}/queue-jumbo?) declared in the helm chart, like this.
  • Helm chart and helmfile modifications to support consuming from multiple kafka clusters.

If this isn't too hard, this option would be an okay compromise, assuming SRE ServiceOps won't like Option A.

I'm not sure, but I don't think this will require any actual change-prop code changes. Just helm config changes to declare the new routes and kafka configs.

Option D. Consume mediawiki.page_change.v1 instead

This is probably the fastest path to production. LiftWing already responds to mediawiki.page_change.v1 events via change-prop and Kafka main. Doing this for tone check score would mean that the page content would have to be looked up from the MediaWiki API at score time, rather than just getting it out of the page_content_change event body.

This is already done for other models in LiftWing, so perhaps this is easy to do quickly?

I'd prefer to avoid the extra MW API lookups for page content for all of the other LiftWing usages too. All of the other options would allow us to do that.

@dcausse Thanks a lot! I found it was also missing $schema. (eventgate complained about it)

$ cat mock_event.json 
{"$schema":"/mediawiki/cirrussearch/page_weighted_tags_change/1.0.0","meta":{"stream":"mediawiki.cirrussearch.page_weighted_tags_change.v1","domain":"test.wikipedia.org"},"dt":"2025-11-05T09:00:00Z","wiki_id":"testwiki","rev_based":true,"page":{"page_id":168753,"page_title":"Schwarzwald_Castle","namespace_id":0,"is_redirect":false},"weighted_tags":{"set":{"recommendation.tone":[{"tag":"exists","score":1.0}]}}}

$ curl -H"User-Agent: achou-T401021/wmf" -H"Content-Type: application/json" -XPOST https://eventgate-main.discovery.wmnet:4492/v1/events -d @mock_event.json

With above, I didn't get any eventgate validation error.

@Michael I created 4 mock events based on your csv file and have pushed them to the Search weighted tags. Let me know if you can see them.

Yes, we can find them there with https://test.wikipedia.org/w/index.php?search=hasrecommendation%3Atone, thank you! We also reproduced Eric's query on our end.

This should allow us to move forward with building the integration needed for this 🙏

Eevans updated the task description. (Show Details)

Change #1204687 had a related patch set uploaded (by Eevans; author: Eevans):

[operations/deployment-charts@master] data-gateway: deploy v1.0.13 to production

https://gerrit.wikimedia.org/r/1204687

Change #1204687 merged by jenkins-bot:

[operations/deployment-charts@master] data-gateway: deploy v1.0.13 to production

https://gerrit.wikimedia.org/r/1204687

@achou I'm catching up! I'm curious to learn (or remember) what the status is with the change-prop issue? I assume we just did Option D?

@Ottomata Yes, we proceeded with Option D (more info in T409469). Btw, we moved from wikitext to HTML for the Revise Tone task generator in Lift Wing - now we fetch HTML content from the REST API instead of wikitext from the MediaWiki API (T412210).

Thank you!

we fetch HTML content from the REST API

Oh very interesting! FYI T360794: Event stream with latest revision HTML & parent revision HTML diff is in progress in case we ever have the bandwidth or desire to switch to that. *We'd have to solve the same change-prop issues though.)

Resolved this task. Really appreciate all the input and collaboration from everyone. :)

Oh very interesting! FYI T360794: Implement stream of HTML content on mw.page_change event is in progress in case we ever have the bandwidth or desire to switch to that. *We'd have to solve the same change-prop issues though.)

Yeah, we talked about it during yesterday's ML<>Research<>Data Platform meeting. Great to see it's moving forward!