Page MenuHomePhabricator

Design event schema for ML scores/recommendations on current page state
Closed, ResolvedPublic

Description

In T328899 and in T326179, Research and ML teams will be creating new event streams that 'score' wiki pages. In T328576, ML team is deprecating the mediawiki.revision-score stream in favor of model specific streams, e.g. mediawiki.revision_score_drafttopic.

In T308017: Design Schema for page state and page state with content (enriched) streams, we have remodeled the way we represent changes to mediawiki pages in events. We should use this new event schema model for representing page scores in events too. In the case of T328576 (e.g. mediawiki.revision_score_drafttopic), we don't want to remodel the old deprecated revision based streams, but we may want to use the same 'score' data model field for both the new streams, as well as these 'new' old ones.

We currently have a mediawiki/revision/score schema. It adds a scores map field like this:

scores:
      example_model:
        model_name: example_model
        model_version: 1.0.1
        prediction:
        - yes
        - mostly
        probability:
          yes: 0.99
          mostly: 0.9
          hardly: 0.01

This was done as a map field so multiple ML model scores for the revision could be included in the same event.

Do we like this scores field? We have the opportunity to do whatever we want here, so let's take some time to brainstorm and bikeshed on what would be best, so we can use it in all the various ML use cases coming up.

Q: Would it be possible to use the same event field data model for things like image-suggestions?

Event Timeline

It seems we can remove the map field since we are now building a stream per model. Additionally, you mentioned that querying data in a map field is difficult. The updated schema would be:

score:
    model_name: example_model
    model_version: 1.0.1
    prediction:
    - yes
    - mostly
    probability:
     yes: 0.99
     mostly: 0.9
     hardly: 0.01

The schema fits the outlink topic and the revert-risk model, which are two new models deployed on Lift Wing.

An example for the outlink topic model would be like:

score:
    model_name: outlink-topic-model
    model_version: 1.0.1
    prediction: # topic probability > 0.5
    - biography
    - women
    probability: # list all topics and probablities
     biography: 0.99
     women: 0.9
     geography: 0.01

And for the revert-risk model:

score:
    model_name: revertrisk
    model_version: 1.0.1
    prediction:
    - true
    probability:
     true: 0.9
     false: 0.1

Q: Should the score schema be added to fragment/ so that it can be referenced for different streams?

Nice!

Quick naming bikeshed: is score the best name for this field? Is that a generally used term for ML predictions? Would prediction be better? (Assuming prediction is better for the rest of this comment.)

Q: Should the score schema be added to fragment/ so that it can be referenced for different streams?

Yes I think so. But, not necessarily for different streams. We should probably make a mediawiki/page/prediction_change schema that can and will be used by many streams (e.g. mediawiki.page_outlink_prediction_change, `mediawiki.page_revert_risk_predicition_change.

But ya, this predicition field will be useful for scoring other entities, e.g. revisions in T328576, (and maybe users? other things?), so a field fragment schema would be useful.

But what namespace? Is this prediction field MediaWiki specific? It doesn't seem to be page specific. If always MediaWiki specific, we could do: fragment/mediawiki/state/entity/prediction ?

Do we like this scores field? We have the opportunity to do whatever we want here, so let's take some time to brainstorm and bikeshed on what would be best, so we can use it in all the various ML use cases coming up.
Q: Would it be possible to use the same event field data model for things like image-suggestions?

So I can think of a few types of models in terms of output types:

  • Classification models (topic, revert, quality, etc.) -- all of these are essentially some sort of class and associated [0-1] probability which seems well-supported.
  • Recommendation models (add-a-link; add-an-image) -- currently my understanding of these models is that they are run in batch and two types of data are produced: tags indicating if an article has a recommendation (that could easily be supported with this schema as a has-rec score with probability of 1) and then the specific recommendations themselves are stored in a Mediawiki table (example schema: T267329). These recs are a lot more complicated. The add-a-link example has a bunch of context fields (see below). Supporting these I assume would require re-introducing arbitrary maps? Or maybe recommendation outputs just would require a second schema (which makes a lot of sense to me because of how different they are from classification models).
* phrase_to_link (text)
* context_before (text -- 5 characters of text that occur before the phrase to link)
* context_after (text -- 5 characters of text that occur after the phrase to link)
* link_target (text)
* instance_occurrence (integer) -- number showing how many times the phrase to link appears in the wikitext before we arrive at the one to link
* probability (boolean)
* insertion_order (integer) -- order in which to insert the link on the page (e.g. recommendation "foo" [0] comes before recommendation "bar baz" [1], which comes before recommendation "bar" [2], etc)
    • A simpler subset of these recommendation models would be ones that require less context -- e.g., we have a model that generates potential descriptions to be added to Wikidata. For that, the output would just be a ranked list of text -- e.g., 1: Article description, 2: Description of an article, 3: Another description). The current schema presumably could be hacked to make that work. Top-ranked description in the prediction and then each description is the key in the probability part (or maybe the value and the key is the rank?).
  • Embeddings -- we currently train embeddings as part of some of the classification models but it's very reasonable to think that at some point we might want to have a model that just outputs embeddings for articles etc. everytime they're edited so other tools could make use of them without having to train their own. The output format for that then is an n-dimensional vector of floats. In the current schema, it wouldn't really have a prediction but the probability field could probably be repurposed to have the key be the vector index (0,1,2, ..., n) and the value be the embedding value for that index. For my own work, I use 50-dimensional embeddings for space reasons but it's not uncommon to see embeddings on the order of 1000 dimensions, especially for media like images.

is score the best name for this field? Is that a generally used term for ML predictions?

I don't remember the logic behind score other than that's what Aaron had always used -- i.e. used to be Scoring Platform team before ML Platform. prediction probably is the more general term but then that feels confusingly redundant with the nested prediction field. Maybe something like model_output as top-level name to allow for different types of outputs? And then it seems we're using prediction to be the summary of the model ouputs (which makes sense generally) and probability to be the full set of outputs with their associated confidence scores. In my comment above though, I suggested a few ways to abuse the probability field in ways that don't really have probabilities (ranked results; embedding vectors) so if we go that direction, I'm wondering if something more generic like details is the only consistent umbrella term? If that feels too generic, then maybe it just makes sense to have three separate schema (one for classification models, one for embeddings, one for recommendation models)?

maybe it just makes sense to have three separate schema (one for classification models, one for embeddings, one for recommendation models)?

Based on everything you just wrote, I think this is the better way to go. Since we can do what we want here, we don't have to hammer the peg into the hole :)

For the use cases at hand (old revision score, outlink, and revert risk), what about 'predicted_classification'? Then we could do:

predicted_classification:
    model_name: outlink-topic-model
    model_version: 1.0.1
    classifications: # topic probability > 0.5
    - biography
    - women
    probabilities: # list all topics and probablities
     biography: 0.99
     women: 0.9
     geography: 0.01

And for the revert-risk model:

score:
    model_name: revertrisk
    model_version: 1.0.1
    prediction:
    - true
    probability:
     true: 0.9
     false: 0.1

This works for me.

For more general usage, I think we could add something about the parameters used for a given prediction (eg. topic probability; prediction threshold) so it would be something like this:

score:
    model_name: revertrisk
    model_version: 1.0.1
    prediction:
    - true
   arguments:
    -threshold: 0.5
    probability:
     true: 0.9
     false: 0.1

However, if arguments are going to be stable for all the outputs, I think this information will be redundant, and I'll be more towards giving this details on the documentation.
I'm wondering if we also want to add information about individual explanations for this recommendations. Together with the ML-Team we are exploring the "Explainer" function on LiftWing T330131 , and maybe for models with this function activated, it would be good to add the link that exact recommendation:

score:
    model_name: revertrisk
    model_version: 1.0.1
    prediction:
    - true
    probability:
     true: 0.9
     false: 0.1
explainer
  url: https://{liftwing}.explainer/model/params/rev_id

+1 having separate schemas for classification models, embeddings, and recommendation models.

For classification models, what about just using classification as top-level name and keeping the name of prediction and probabilities inside:

classification:
    model_name: outlink-topic-model
    model_version: 1.0.1
    prediction: # topic probability > 0.5
    - biography
    - women
    probabilities: # list all topics and probablities
     biography: 0.99
     women: 0.9
     geography: 0.01

For more general usage, I think we could add something about the parameters used for a given prediction (eg. topic probability; prediction threshold)

The parameters like threshold should be the same (or use a default value) for all outputs in the stream. All necessary parameters for the model are obtained from the source event (e.g., get "rev_id," "page_title," and "lang" from page_change event), but we won't get the threshold from the source event. Also I'm inclined to put the information on the documentation.

I'm wondering if we also want to add information about individual explanations for this recommendations.

I don't see why we need to add the explainer url. Could you elaborate more? If you're referring to adding the explanation (result from the explainer), since ChangeProp will call the predict endpoint of Liftwing, not the explain endpoint, we won't have the explanation. Moreover, the explainer functionality is still in the exploratory phase in Liftwing, so we're not yet sure what the output looks like for the explainer.

Moreover, the explainer functionality is still in the exploratory phase in Liftwing, so we're not yet sure what the output looks like for the explainer

Adding new fields later isn't too hard, so we can address this later.

For classification models, what about just using classification as top-level name

+1, but, do you think we would want some kind of common naming convention for 'ML prediction' fields? We don't need to bikeshed their schemas now, but what might we call the fields for representing embeddings and recommendations?

ChangeProp will call the predict endpoint

Is this the same endpoint that would be used for embeddings and recommendation models?

keeping the name of prediction and probabilities inside.

+1. Nit: let's call the field predictions since it is an array?

+1, but, do you think we would want some kind of common naming convention for 'ML prediction' fields?

Good point. Starting with predicted_ might be a good idea, so there are predicted_classification, predicted_embeddings and predicted_recommendations.

Is this the same endpoint that would be used for embeddings and recommendation models?

Yeah, all the models deployed on Lift Wing have the internal endpoint like https://inference.discovery.wmnet:30443/v1/models/{MODEL_NAME}:predict

+1. Nit: let's call the field predictions since it is an array?

Sure!

We should probably make a mediawiki/page/prediction_change schema that can and will be used by many streams (e.g. mediawiki.page_outlink_prediction_change, mediawiki.page_revert_risk_predicition_change.

I see, so it's like the mediawiki/revision/score schema can be used by many streams e.g. mediawiki.revision_score_$model. In this case, I think we don't need a field fragment schema for now.

Good point. Starting with predicted_ might be a good idea, so there are predicted_classification, predicted_embeddings and predicted_recommendations.

Makes sense to me -- the way I see it is the predicted_ fields could be the dependable/required fields for any downstream applications whereas a field like probabilities might be a bit less standardized and aimed more at research/debugging -- e.g., for a topic model with 64 classes, fine to include all probabilities it seems. If a model had 1000 classes though, maybe doesn't make so much sense to include them all.

it's like the mediawiki/revision/score schema can be used by many streams e.g. mediawiki.revision_score_$model. In this case, I think we don't need a field fragment schema for now.

Sounds good! Especially since we won't be using this predicted_classification field for other prediction kinds (embeddings, etc.).

And, it is easy to factor out the field as a fragment later if we need to, like if we want to make classification predictions on things other than mediawiki pages (users?).

So! The field we've discussed so far would look like this:

predicted_classification:
  model_name: example_model
  model_version: 1.0.1
  predictions:
  - yes
  - mostly
  probabilities:
    yes: 0.99
    mostly: 0.9
    hardly: 0.01

We'd put this schema at mediawiki/page/prediction_classification_change. This schema would $ref /fragment/mediawiki/state/change/page/1.0.0 just like mediawiki/page/change does, and add the new predicted_classification field.

Does that sound right?

It seems we can remove the map field since we are now building a stream per model.

Just to be totally sure, I want to revisit this question once more. For page content, there are actually multiple slots in the content_slots field, similar to how we had multiple scores in one event before.

Are we really sure we will never want to represent multiple classifications in the same page change event?

If a model had 1000 classes though, maybe doesn't make so much sense to include them all.

Yeah good point. When the event is generated on the Lift Wing side, we can choose to include only the first X classes. What do you think?

Does that sound right?

Yes, it does!

Are we really sure we will never want to represent multiple classifications in the same page change event?

Currently, events generated by Lift Wing only have one classification from a single model server. If we switch to Flink or other approaches to process events in the future, we may be able to support multiple classifications in the same event. If this is the case, we can update the schema version accordingly at that time. What do you think?

If this is the case, we can update the schema version accordingly at that time.

Updating the schema later to do this will not be easy, as it would be an incompatible type change.

we may be able to support multiple classifications in the same event

I guess this is more of the question: Do we want to ever be able to do this? If the answer is 'probably not', then let's keep the single classification field as proposed.

I guess this is more of the question: Do we want to ever be able to do this? If the answer is 'probably not', then let's keep the single classification field as proposed.

I think there is a possibility we will want to represent multiple classifications in the same page change event in the future. Also, there is a possibility we may want to have both embeddings and classifications in the same event.

Updating the schema later to do this will not be easy, as it would be an incompatible type change.

BTW, recently created T332212: Major (API) versioning of Event Platform streams to decide a policy on how to do major version changes.

there is a possibility we will want to represent multiple classifications in the same page change event in the future. Also, there is a possibility we may want to have both embeddings and classifications in the same event.

Oh okay, if that is the case...we might want to keep the (slightly annoying) map field then?

Grr. another rephrasing of the question is: what are the user use cases for having multiple classifications / embedding predictions in the same event stream? Do you expect users to want to be able to get multiple classifications for every change to do something useful? I wonder if @elukey
and folks have thought about this, since they decided to separate the LiftWing model endpoints.

what are the user use cases for having multiple classifications / embedding predictions in the same event stream?

I mentioned embeddings + classifications because embeddings usually serve as the features for classification models. I imagine it may be useful to have them in the same event stream. And for multiple classifications, maybe we'd like to have predictions from the different revert-risk model family (language agnostic, multilingual, ... ) in the same event stream. However, I think the research folks @diego @Isaac would be better suited to answer this question and make a decision. After all, the role of the ML team is to support the models and use cases from the research team.

I mentioned embeddings + classifications because embeddings usually serve as the features for classification models. I imagine it may be useful to have them in the same event stream. And for multiple classifications, maybe we'd like to have predictions from the different revert-risk model family (language agnostic, multilingual, ... ) in the same event stream. However, I think the research folks @diego @Isaac would be better suited to answer this question and make a decision. After all, the role of the ML team is to support the models and use cases from the research team.

I agree with all you said @achou.

I imagine it may be useful to have them in the same event stream.

We could def put them in the same event stream, as long as they share the same schema.

@diego the question is, should they be in the same event? Is there ever a reason a user would want to have many/all model predictions for a page in the same event?

I imagine it may be useful to have them in the same event stream.

We could def put them in the same event stream, as long as they share the same schema.

@diego the question is, should they be in the same event? Is there ever a reason a user would want to have many/all model predictions for a page in the same event?

I would say that having them in the same stream is not a must, but could be convenient specially when we have two models for the same purpose, but with different flavors, for example a Language Agnostic Model and Multilingual ones for Revert Risk. Anyhow, they can also be merged "on the client side" later.

having them in the same stream

Just to be clear! 'same event' 'same stream'.

We are trying to figure out if the data model of a single event should have the capacity to represent multiple predictions in the same event. To do this, the predictions field has to be more complex (a map type), which is more flexible as to the final data in the event, but is worse for data discovery and more annoying for SQL querying.

Same stream (not same event) would mean that 2 events for the same page edit could have different predictions, but each event would only have one prediction.

Yes, I was thinking on the same event. Like:

scores:
        model_name: example_model_1
        model_version: 1.0.1
        prediction:
        - yes
        - mostly
        probability:
          yes: 0.99
          mostly: 0.9
          hardly: 0.01
      example_model_2:
        model_name: example_model_2
        model_version: 1.0.2
        prediction:
        - yes
        - mostly
        probability:
          yes: 0.80
          mostly: 0.9
          hardly: 0.01

Anyhow, they can also be merged "on the client side" later.

I think I would lean towards this. I like the simplicity of separate streams and in Diego's example, I think might be nice to not have the multilingual model (which if I remember is higher latency) be a blocker for the language-agnostic prediction stream?

I mentioned embeddings + classifications because embeddings usually serve as the features for classification models.

I think if we get to a place where we're outputting embeddings as a stream, it's probably good to have it separate from the classifications -- i.e. it should be its own embeddings model with the goal of outputting an embedding and not explicitly tied to a particular classification model. If we want to output an embedding to provide context for a given prediction, however, that feels more like the explainabililty endpoint that you all have been working on.

I think I would lean towards this. I like the simplicity of separate streams and in Diego's example, I think might be nice to not have the multilingual model (which if I remember is higher latency) be a blocker for the language-agnostic prediction stream?

True, and in general the multiple predictions per event approach would make the system less resilient to errors/delays.

We could def put them in the same event stream, as long as they share the same schema.

Understood! Thanks for explaining the difference between 'same event' and 'same stream' more clearly.

I think if we get to a place where we're outputting embeddings as a stream, it's probably good to have it separate from the classifications -- i.e. it should be its own embeddings model with the goal of outputting an embedding and not explicitly tied to a particular classification model. If we want to output an embedding to provide context for a given prediction, however, that feels more like the explainabililty endpoint that you all have been working on.

Agree.

From a system and practical perspective, I would prefer that a single event represents a single prediction, which is also currently supported by Lift Wing. I was just presenting possible use cases for everyone to discuss. :)

Okay, so it sounds like we are back to our preferred choice: one prediction per event. Thanks all.

Change 905965 had a related patch set uploaded (by AikoChou; author: AikoChou):

[schemas/event/primary@master] Add event schema for ML classification change on current page state

https://gerrit.wikimedia.org/r/905965

Nice! I'll add some comments there, but ask another question here for visibility.

Q: Will it be useful to have the 'prior state' of predicted_classifications in this event? And/or is it even possible to get that? I suppose that LiftWing handles requests by revision, so to do this we'd have to ask for the predicted_classification of both the parent rev_id and the current rev_id?

If the answer is yes, then we'll want to add a prior_state.predicted_classification field like this:

properties:
  # ...
  prior_state:
    type: object
    properties:
      predicted_classification:
        $ref: '#/properties/predicted_classification'

Q: Will it be useful to have the 'prior state' of predicted_classifications in this event?

This is very tempting but I don't personally have a super strong use-case for it and it feels reasonably expensive to get right. A few thoughts:

  • The best use-case I can think of for it is in being more kind when updating our Search indices -- e.g., for every revision, we compute the article topics and only if they're different from the previous topics do we send an update to the Search index. This would greatly reduce the updates to Search as most edits won't change an article substantially enough to change its topic. The tricky thing is that the topic model uses an article's links via the pagelinks table, so we don't currently have a way of getting a prediction for a past revision. For this to be feasible, I assume we'd need some cache of prior predictions? This is an extreme case but in general, it's not always a perfect assumption that the current model prediction for an old revision will be the same as the then-current model prediction for an old revision and that could cause issues depending on how we source the prior prediction.
  • For other use cases where we're just interested in triggering some behavior based on substantive changes to the article content as proxied by e.g., a large change in quality, my assumption is that we probably should instead focus on getting a stream enrichment that does edit types (diffs) and use that more directly. For example, if we want to flag when an article's quality decreases by a certain quantity, we're probably actually interested in edits that are removing certain types of content and we should just detect that directly with the edit types. The nice thing about the edit types library is that it would just be a direct enrichment and not a LiftWing call, so once there's a stream with previous+current wikitext in it, it's just a processing of those two strings with no additional API calls (or stream with current wikitext and we have the API call to get the previous wikitext).

I would say that it would be nice to have it but not a must. As Isaac points out, it could be useful for some specific models, but not for all of them. So, depending on the complexity, I would like to have this as an optional field for some events.

Q: Will it be useful to have the 'prior state' of predicted_classifications in this event?

This is very tempting but I don't personally have a super strong use-case for it and it feels reasonably expensive to get right. A few thoughts:

  • The best use-case I can think of for it is in being more kind when updating our Search indices -- e.g., for every revision, we compute the article topics and only if they're different from the previous topics do we send an update to the Search index. This would greatly reduce the updates to Search as most edits won't change an article substantially enough to change its topic.

The search system already has similar optimizations built-in, it can retrieve the prior state from its own datastore.

Okay, thanks all. No prior_state for now then. We can always add later if we decide to.

Change 907923 had a related patch set uploaded (by AikoChou; author: AikoChou):

[machinelearning/liftwing/inference-services@main] WIP - events: add code to generate predicted_classification events

https://gerrit.wikimedia.org/r/907923

Change 905965 merged by jenkins-bot:

[schemas/event/primary@master] Add event schema for ML classification change on current page state

https://gerrit.wikimedia.org/r/905965

Change 907923 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] events: add code to generate predicted_classification events

https://gerrit.wikimedia.org/r/907923

Change 914768 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate - bump image version to pick up new schemas

https://gerrit.wikimedia.org/r/914768

Change 914768 merged by Ottomata:

[operations/deployment-charts@master] eventgate - bump image version to pick up new schemas

https://gerrit.wikimedia.org/r/914768

Change 914769 had a related patch set uploaded (by Ottomata; author: Ottomata):

[operations/deployment-charts@master] eventgate-main - bump image version to pick up new schemas

https://gerrit.wikimedia.org/r/914769

Change 914769 merged by Ottomata:

[operations/deployment-charts@master] eventgate-main - bump image version to pick up new schemas

https://gerrit.wikimedia.org/r/914769