Page MenuHomePhabricator

Create event stream for article-country model-server hosted on LiftWing
Closed, ResolvedPublic

Description

In T371897, we deployed the article-country model-server on LiftWing. In this task, we are going to incorporate the article-country predictions into the Search index using a stream. Below are the key requirements and steps to accomplish this:

  • Identify the source event stream for this model
  • Decide whether to filter traffic from the stream
  • Define the schema for events generated by the model-server
  • Configure and deploy the new event stream to include events generated by the model-server

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Adding @EBernhardson as well because when this stream moves forward, we'll want to ingest it into Search (past conversations about this were in: T301671). Presumably the stream will be a model prediction for every Wikipedia edit that includes a list of 0-250 countries (list) along with a score between 0.0 and 1.0. Most outputs will be a single country with a score of 1.0 as in the case of e.g., a person who was born in one country and never was strongly associated with any other places.

For more details:

@Isaac Thanks; we expected it to be ready in Q3 onwards based on your earlier assessment of the next hypothesis steps.

Update: we are starting this work next week so we'll be providing updates on this task.

@isarantopoulos great news and many thanks for finding the space! Just let me know where I can help.

Change #1111565 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: process inputs from source event stream

https://gerrit.wikimedia.org/r/1111565

Change #1111565 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: process inputs from source event stream

https://gerrit.wikimedia.org/r/1111565

Change #1111917 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: send prediction results to output event stream

https://gerrit.wikimedia.org/r/1111917

Change #1111917 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: send prediction results to output event stream

https://gerrit.wikimedia.org/r/1111917

Change #1112126 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] changeprop: add liftwing article-country stream to staging

https://gerrit.wikimedia.org/r/1112126

Change #1112127 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: match event model name with isvc host header

https://gerrit.wikimedia.org/r/1112127

Change #1112127 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: match event model name with isvc host header

https://gerrit.wikimedia.org/r/1112127

Change #1112449 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country deployment image

https://gerrit.wikimedia.org/r/1112449

Change #1112451 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/mediawiki-config@master] EventStreamConfig: Add mediawiki.page_article_country_prediction_change stream

https://gerrit.wikimedia.org/r/1112451

Change #1112449 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country deployment image

https://gerrit.wikimedia.org/r/1112449

Change #1112126 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add liftwing article-country stream to staging

https://gerrit.wikimedia.org/r/1112126

isarantopoulos updated the task description. (Show Details)

After a discussion we had with the Search team on the topic, our team will need to do the following:

  • publish the messages to the mediawiki.cirrussearch.page_weighted_tags_change.rc0 stream according to the docs.
  • According to the Stream and the schema definition both the page_id and the page_title are required. This will require a modification to the model server as it currently processes the page_title while making a request. One option would be the ability to make a request either using a page_title or a page_id and use the latter in this use case. Alternatively we'd have to get the page_id within the model server by querying the mediawiki api.

@dcausse @kevinbazira please add any more information or correct the above. Thanks!

Thanks all for working this out! I know a lot of moving parts here so I appreciate the work to figure out the best approach and who owns what piece. Just to make sure I understand (for this project and future streams):

  • This mediawiki.cirrussearch.page_weighted_tags_change.rc0 stream is now the interface point between what LiftWing outputs and what goes into Search. Beyond matching that standard schema around page ID/title, that also means we need to define the tag prefix now. There's already a fair bit of code written on CirrusSearch for handling articletopic-related inputs so presumably we want to build on that because article-country is closely related. The existing prefixes are classification.ores.articletopic (this model) and classification.ores.drafttopic (this model). I would suggest not using either of those because we may want the ability to e.g., flush out one set of predictions due to a model update/deprecation without affecting the others. It looks like Kevin had been going with liftwing.test-article-country-events on staging and if I merge that with the existing Search tag norms, it sounds like classification.liftwing.articlecountry maybe is a good choice?
  • Once this stream is live, there will still be the question of how to actually query the tags. For that, I presume we'll still need a bit of helper code added to CirrusSearch in ArticleTopicFeature.php? Will @dcausse be handling this (with input from the LPL team as the Product stakeholder) or do we still need to find someone to be responsible for this change? Presumably this means:
    • Adding a new variable ARTICLE_COUNTRY_TAG_PREFIX and pointing to the new classification.liftwing.articlecountry prefix.
    • Updating PREFIX_PER_KEYWORD variable to assign it to a new search keyword. I had originally thought that merging in the country labels with the existing articletopic labels might make sense, but I see now that that could get messy with these generic streams so I presume we'll want to go with articlecountry for the new search keyword?
    • Updating the new TERMS_TO_LABELS mapping so model outputs like Bonaire, Sint Eustatius, and Saba don't need quoted and are perhaps easier to guess. I left some thoughts in T301671#10468557 about some considerations for this.

classification.liftwing.articlecountry maybe is a good choice?

Suggestion, do not put backend platform names (like LiftWing) in data.

LiftWing is an implementation detail, no? It is the name we use to refer to our infrastructure that serves ML models as HTTP APIs.

Maybe something like classification.prediction.articlecountry?

Suggestion also maybe to version this in case you want to have an easy way to migrate to serving a new version, while still keeping the old ES prefix around? Then you could vary (a/b test?) on which does better in search results / the feature?

classification.prediction.articlecountry.v1?

After a discussion we had with the Search team on the topic

I'm going to reflect a point I made in this discussion.

+1 to producing to mediawiki.cirrussearch.page_weighted_tags_change.rc0 to update the search index.

However, cirrussearch.page_weighted_tags_change is a command to update the MediaWiki search index. This means that unlike mediawiki.page_outlink_topic_prediction_change.v1, the articlecountry prediction will not be available for re-use in an event stream or Hive dataset.

  • It will be difficult to trace the changes of predictions made for an article overtime
  • It will be difficult to join this information with other datasets (including ones created by Metrics/Experimentation Platform for product metrics).
  • It will be difficult to evaluate the model's performance over time.
  • It will also be difficult to expose as a dataset / stream for volunteer developers or WMF Enterprise at https://stream.wikimedia.org/v2/ui/#/.

cirrussearch.page_weighted_tags_change is not a data product, it is an command to update a specific datastore.

I suggested that both the mediawiki.article_country_prediction_change.v1 and the mediawiki.cirrussearch.page_weighted_tags_change.rc0 stream be produced to.

The counter argument is that we don't know that users want to use a mediawiki.article_country_prediction_change dataset.

I appreciate this sentiment, but from experience, what happens is that users DO want this kind of data, and if teams that own data do not think about those use cases (explicitly requested, or potential) while they are first creating the data, they will never do so. We are under-resourced, and once the ML team moves on to the next project, it will be difficult to justify resourcing this again. When a user does come to ask for this data, their request will be put on the backlog for years, if ever completed.

This leads to a culture of brittle DIY projects and data pipelines, as data users cannot rely on feature supporting teams to treat their data as a product. See also Elephant 4. Problem 1 which was discussed at November 2024's Data Strategy Convening

Generally, we are trying to shift the culture at WMF to think about more than just UI product feature end users as the users of data. See also Data as a Product.

If the work to do this was significant, I don't think I would be writing such a long message advocating for this. But, since we already spent a significant amount of time enabling the creation of MediaWiki page ML prediction data products, the hard work is already done!

FWIW, this is 100% a product decision, but I ask that product decisions like this be made with the holistic organization wide view of the potential power of reusable data products, rather than only for the immediate product feature need.

@Isaac @mpopov, I'm curious to see what you think here. (I'm trying to avoid making more data gaps ;) )

cc also @SSalgaonkar-WMF

  • According to the Stream and the schema definition both the page_id and the page_title are required. This will require a modification to the model server as it currently processes the page_title while making a request. One option would be the ability to make a request either using a page_title or a page_id and use the latter in this use case. Alternatively we'd have to get the page_id within the model server by querying the mediawiki api.

page_title and page_id should already be part of the mediawiki.page_change.v1 events, IIRC the outlink model seems to have access to the whole event could this be using the same technique here to avoid fetching something additional via the mw-api?

Thanks all for working this out! I know a lot of moving parts here so I appreciate the work to figure out the best approach and who owns what piece. Just to make sure I understand (for this project and future streams):

  • This mediawiki.cirrussearch.page_weighted_tags_change.rc0 stream is now the interface point between what LiftWing outputs and what goes into Search. Beyond matching that standard schema around page ID/title, that also means we need to define the tag prefix now. There's already a fair bit of code written on CirrusSearch for handling articletopic-related inputs so presumably we want to build on that because article-country is closely related. The existing prefixes are classification.ores.articletopic (this model) and classification.ores.drafttopic (this model). I would suggest not using either of those because we may want the ability to e.g., flush out one set of predictions due to a model update/deprecation without affecting the others. It looks like Kevin had been going with liftwing.test-article-country-events on staging and if I merge that with the existing Search tag norms, it sounds like classification.liftwing.articlecountry maybe is a good choice?

Correct, I vote for classification.prediction.articlecountry as suggested by Andrew. (note that we don't yet have a great story for handling migrations yet, so perhaps I'd suggest to not add the version suffix just yet, can be added later on the first migration if we believe it's helpful).

  • Once this stream is live, there will still be the question of how to actually query the tags. For that, I presume we'll still need a bit of helper code added to CirrusSearch in ArticleTopicFeature.php? Will @dcausse be handling this (with input from the LPL team as the Product stakeholder) or do we still need to find someone to be responsible for this change? Presumably this means:
    • Adding a new variable ARTICLE_COUNTRY_TAG_PREFIX and pointing to the new classification.liftwing.articlecountry prefix.
    • Updating PREFIX_PER_KEYWORD variable to assign it to a new search keyword. I had originally thought that merging in the country labels with the existing articletopic labels might make sense, but I see now that that could get messy with these generic streams so I presume we'll want to go with articlecountry for the new search keyword?
    • Updating the new TERMS_TO_LABELS mapping so model outputs like Bonaire, Sint Eustatius, and Saba don't need quoted and are perhaps easier to guess. I left some thoughts in T301671#10468557 about some considerations for this.

The search platform can take this part, few things to consider/keep in mind:

  • topic names did not contain any spaces but country names might: "United States", the search keyword will have to support these by wrapping parenthesis: articlecountry:"United States"
    • update: some topic names did include spaces so this is not something new
  • (probably too late but) why not changing the predictions to 3 letters ISO3166 instead of country names?

Regarding TERMS_TO_LABELS I agree that it will require some updates to make the keyword a bit more user-friendly but we would certainly need help to decide what's appropriate here.

Change #1114355 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] events: add support for the weighted tags event stream

https://gerrit.wikimedia.org/r/1114355

Change #1114600 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: send prediction results to weighted tags stream

https://gerrit.wikimedia.org/r/1114600

classification.prediction.articlecountry

Works for me!

topic names did not contain any spaces but country names might: "United States", the search keyword will have to support these by wrapping parenthesis: articlecountry:"United States"
(probably too late but) why not changing the predictions to 3 letters ISO3166 instead of country names?

@dcausse I'm going to copy these questions over to T301671 because that's probably a better place for this discussion as I don't think it'll affect the stream that Kevin is currently creating. The quick answer for why we're not using 3-letter ISO codes (which would solve the complex syntax issue) is:

  • The model was developed to be directly consumed by end-users who won't usually know the ISO codes (some are obvious like United States -> USA but many do not make sense at least for English speakers such as Algeria -> DZA or United Arab Emirates -> ARE). I think those codes might be a reasonable way to encode in Search if we'd like but that's mainly because I suspect most usage of this feature will be mediated through tools like Content Translation (as opposed to direct usage by readers/editors).
  • It was also developed to interface nicely with our other geographic datasets like Geoeditors which just use the full country names.
  • That said, it's a valid point that we could have included the country codes as part of the response even if they aren't standard in our geographic datasets. Even if we had direct access to those ISO codes in the stream, I think we still might have wanted to hard-code the list of acceptable countries though in the Search code because that's also used to trigger a message that points to the documentation when someone searches for an invalid key.
isarantopoulos raised the priority of this task from Medium to High.Jan 28 2025, 2:38 PM

Change #1114355 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] events: add support for the weighted tags event stream

https://gerrit.wikimedia.org/r/1114355

Change #1114600 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: send prediction results to weighted tags stream

https://gerrit.wikimedia.org/r/1114600

Change #1115160 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: update naming for prediction classification change stream

https://gerrit.wikimedia.org/r/1115160

Change #1115160 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: update naming for prediction classification change stream

https://gerrit.wikimedia.org/r/1115160

Change #1115359 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country staging config

https://gerrit.wikimedia.org/r/1115359

Change #1115359 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country staging config

https://gerrit.wikimedia.org/r/1115359

I suggested that both the mediawiki.article_country_prediction_change.v1 and the mediawiki.cirrussearch.page_weighted_tags_change.rc0 stream be produced to.

@Ottomata We discussed this again last week after our meeting and we're going to produce it to both streams for all the reasons you mentioned above.
We're focusing on making this available to the weighted tags stream first and then we'll produce it to prediction_change as well.

  • (probably too late but) why not changing the predictions to 3 letters ISO3166 instead of country names?

@dcausse I do understand that using an ISO code makes more sense. Is this a requirement from your side? While we don't have the iso codes available in the service at the moment if it is required we can add them and add an additional field country_code or country_iso_code perhaps. Naming suggestions are welcome.

Thanks everyone for the suggestions. An article-country model-server that supports both event streams has been deployed on LiftWing staging.

We have tested both streams and below are the results:

1. Produce sample event in article-country Kafka topic

On stat1008, published sample mediawiki.page_change.v1 event got from https://stream.wikimedia.org/v2/ui/#/?streams=mediawiki.page_change.v1:

$ cat sample.mediawiki.page_change.v1.event.json | kafkacat -P -b kafka-main1006.eqiad.wmnet:9093 -t staging.liftwing.test-article-country-events -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt

and confirmed event has been published:

$ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t staging.liftwing.test-article-country-events -o -1 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt
{"changelog_kind":"update","page_change_kind":"edit","dt":"2025-01-31T06:26:17Z","wiki_id":"enwiki","page":{"page_id":32167136,"page_title":"Eskimo_potato","namespace_id":0,"is_redirect":false},"performer":{"user_text":"2A00:23C5:FE1C:3701:11AB:84E:417D:3290","groups":["*"],"is_bot":false,"is_system":false,"is_temp":false},"revision":{"rev_id":1273000193,"rev_dt":"2025-01-31T06:26:17Z","is_minor_edit":false,"rev_sha1":"1s3alxhnyqpo703zca2juoofdrs5a3u","rev_size":2450,"rev_parent_id":1252301160,"comment":"\"attributed as\" sounds weird -- you usu. attribute sth TO someone, or to a cause","editor":{"user_text":"2A00:23C5:FE1C:3701:11AB:84E:417D:3290","groups":["*"],"is_bot":false,"is_system":false,"is_temp":false},"is_content_visible":true,"is_editor_visible":true,"is_comment_visible":true,"content_slots":{"main":{"slot_role":"main","content_model":"wikitext","content_sha1":"1s3alxhnyqpo703zca2juoofdrs5a3u","content_size":2450,"content_format":"text/x-wiki","origin_rev_id":1273000193}}},"prior_state":{"revision":{"rev_id":1252301160,"rev_dt":"2024-10-20T19:00:26Z","is_minor_edit":false,"rev_sha1":"ohv6914lvvvn50oppcs1i2jgafqrm7m","rev_size":2455,"rev_parent_id":1213874781,"comment":"Formal grammar touched up; “is used in modern times” leaves open the question of what creature uses it. What is more important, this entry was carelessly worded, for describing “[b]oth species” of two different orders is incompatible with describing “a type of edible plant.” Is the Eskimo potato one plant or two? I have altered this entry to treat it as just one, but perhaps another WP editor can clarify the matter.","editor":{"user_text":"Mucketymuck","groups":["extendedconfirmed","*","user","autoconfirmed"],"is_bot":false,"is_system":false,"is_temp":false,"user_id":33582079,"registration_dt":"2018-04-21T18:00:14Z","edit_count":5513},"is_content_visible":true,"is_editor_visible":true,"is_comment_visible":true,"content_slots":{"main":{"slot_role":"main","content_model":"wikitext","content_sha1":"ohv6914lvvvn50oppcs1i2jgafqrm7m","content_size":2455,"content_format":"text/x-wiki","origin_rev_id":1252301160}}}},"$schema":"/mediawiki/page/change/1.2.0","meta":{"stream":"mediawiki.page_change.v1","uri":"https://en.wikipedia.org/wiki/Eskimo_potato","id":"d6d7f1d3-62f3-461d-bb96-e38a56de4625","request_id":"b22c3729-a963-465c-b9aa-5a560f8e907d","domain":"en.wikipedia.org","dt":"2025-01-31T06:26:17Z","topic":"codfw.mediawiki.page_change.v1","partition":0,"offset":485306037,"key":{"type":"Buffer","data":[123,34,119,105,107,105,95,105,100,34,58,34,101,110,119,105,107,105,34,44,34,112,97,103,101,95,105,100,34,58,51,50,49,54,55,49,51,54,125]}}}
2. Receive event in article-country model-server

On deploy2002, checked whether request with event has been recieved by article-country inference service hosted in LiftWing:

$ kube_env article-models ml-staging-codfw
$ kubectl get pods
$ kubectl logs article-country-predictor-00004-deployment-6c676b99d8-lgzgb
3. Produce both prediction change and weighted tags events

On stat1008, confirmed both article-country events have been published by the model-server:

3.1. mediawiki.page_prediction_change.rc0
$ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.page_prediction_change.rc0 -o -1 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq


{
  "changelog_kind": "update",
  "page_change_kind": "edit",
  "dt": "2025-01-31T06:26:17Z",
  "wiki_id": "enwiki",
  "page": {
    "page_id": 32167136,
    "page_title": "Eskimo_potato",
    "namespace_id": 0,
    "is_redirect": false
  },
  "performer": {
    "user_text": "2A00:23C5:FE1C:3701:11AB:84E:417D:3290",
    "groups": [
      "*"
    ],
    "is_bot": false,
    "is_system": false,
    "is_temp": false
  },
  "revision": {
    "rev_id": 1273000193,
    "rev_dt": "2025-01-31T06:26:17Z",
    "is_minor_edit": false,
    "rev_sha1": "1s3alxhnyqpo703zca2juoofdrs5a3u",
    "rev_size": 2450,
    "rev_parent_id": 1252301160,
    "comment": "\"attributed as\" sounds weird -- you usu. attribute sth TO someone, or to a cause",
    "editor": {
      "user_text": "2A00:23C5:FE1C:3701:11AB:84E:417D:3290",
      "groups": [
        "*"
      ],
      "is_bot": false,
      "is_system": false,
      "is_temp": false
    },
    "is_content_visible": true,
    "is_editor_visible": true,
    "is_comment_visible": true
  },
  "prior_state": {
    "revision": {
      "rev_id": 1252301160,
      "rev_dt": "2024-10-20T19:00:26Z",
      "is_minor_edit": false,
      "rev_sha1": "ohv6914lvvvn50oppcs1i2jgafqrm7m",
      "rev_size": 2455,
      "rev_parent_id": 1213874781,
      "comment": "Formal grammar touched up; “is used in modern times” leaves open the question of what creature uses it. What is more important, this entry was carelessly worded, for describing “[b]oth species” of two different orders is incompatible with describing “a type of edible plant.” Is the Eskimo potato one plant or two? I have altered this entry to treat it as just one, but perhaps another WP editor can clarify the matter.",
      "editor": {
        "user_text": "Mucketymuck",
        "groups": [
          "extendedconfirmed",
          "*",
          "user",
          "autoconfirmed"
        ],
        "is_bot": false,
        "is_system": false,
        "is_temp": false,
        "user_id": 33582079,
        "registration_dt": "2018-04-21T18:00:14Z",
        "edit_count": 5513
      },
      "is_content_visible": true,
      "is_editor_visible": true,
      "is_comment_visible": true
    }
  },
  "$schema": "mediawiki/page/prediction_classification_change/1.1.0",
  "meta": {
    "stream": "mediawiki.page_prediction_change.rc0",
    "id": "4518d06c-e9fe-4a87-af1e-4afb4d9b01ea",
    "request_id": "b22c3729-a963-465c-b9aa-5a560f8e907d",
    "domain": "en.wikipedia.org",
    "uri": "https://en.wikipedia.org/wiki/Eskimo_potato",
    "dt": "2025-02-03T09:43:56.724Z"
  },
  "predicted_classification": {
    "model_name": "article-country",
    "model_version": "1",
    "predictions": [
      "Canada"
    ],
    "probabilities": {
      "Canada": 1
    }
  }
}
3.2. mediawiki.cirrussearch.page_weighted_tags_change.rc0
$ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.cirrussearch.page_weighted_tags_change.rc0 -o -500000 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq 'select(. | tostring | test("classification.prediction.articlecountry"))'
{
  "$schema": "/development/cirrussearch/page_weighted_tags_change/1.0.0",
  "dt": "2025-01-31T06:26:17Z",
  "meta": {
    "stream": "mediawiki.cirrussearch.page_weighted_tags_change.rc0",
    "id": "a1617355-a2f3-42eb-9bda-45b66b155a46",
    "request_id": "b22c3729-a963-465c-b9aa-5a560f8e907d",
    "domain": "en.wikipedia.org",
    "uri": "https://en.wikipedia.org/wiki/Eskimo_potato",
    "dt": "2025-02-03T09:43:56.781Z"
  },
  "page": {
    "namespace_id": 0,
    "page_id": 32167136,
    "page_title": "Eskimo_potato"
  },
  "weighted_tags": {
    "set": {
      "classification.prediction.articlecountry": [
        {
          "tag": "Canada",
          "score": 1
        }
      ]
    }
  },
  "wiki_id": "enwiki",
  "rev_based": true
}

@Ottomata and @dcausse, please let us know whether we should proceed to production with these events.

For production, we will use mediawiki.article_country_prediction_change.v1 instead of mediawiki.page_prediction_change.rc0, while mediawiki.cirrussearch.page_weighted_tags_change.rc0 will remain unchanged.

@Ottomata We discussed this again last week after our meeting and we're going to produce it to both streams for all the reasons you mentioned above.

<3 <3 <3

@Ottomata and @dcausse, please let us know whether we should proceed to production with these events.

LGTM thank you!

production, we will use mediawiki.article_country_prediction_change.v1 instead of mediawiki.page_prediction_change.rc0

Okay! Based on the discussions in the parent task, 'article' seems fine. When you add this to production stream config, could you please add a comment stating that the existent page outlink topic model prediction stream should probably be renamed to 'article', and link to the parent task discussion in the comments? This will help the next person be less confused.

Thank you!

Change #1117063 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] changeprop: add liftwing article-country source stream to prod

https://gerrit.wikimedia.org/r/1117063

@Ottomata and @dcausse, please let us know whether we should proceed to production with these events.

For production, we will use mediawiki.article_country_prediction_change.v1 instead of mediawiki.page_prediction_change.rc0, while mediawiki.cirrussearch.page_weighted_tags_change.rc0 will remain unchanged.

I confirm that your test on staging did work properly I see the prediction in the search index for the page you tested (classification.prediction.articlecountry/Canada|1000): https://en.wikipedia.org/wiki/Eskimo_potato?action=cirrusDump
Once deployed in production we should see the rate of weighted tags going up at https://grafana-rw.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1

  • (probably too late but) why not changing the predictions to 3 letters ISO3166 instead of country names?

@dcausse I do understand that using an ISO code makes more sense. Is this a requirement from your side? While we don't have the iso codes available in the service at the moment if it is required we can add them and add an additional field country_code or country_iso_code perhaps. Naming suggestions are welcome.

Unsure if this is necessary at this point but happy to revisit this whenever you want, if we can live with some mapping on the CirrusSearch side I'm fine with it.

Change #1117063 merged by jenkins-bot:

[operations/deployment-charts@master] changeprop: add liftwing article-country source stream to prod

https://gerrit.wikimedia.org/r/1117063

Change #1112451 merged by jenkins-bot:

[operations/mediawiki-config@master] EventStreamConfig: Add mediawiki.article_country_prediction_change stream

https://gerrit.wikimedia.org/r/1112451

Mentioned in SAL (#wikimedia-operations) [2025-02-04T14:38:31Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]]

Mentioned in SAL (#wikimedia-operations) [2025-02-04T14:43:34Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde, kevinbazira: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-02-04T14:54:54Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1112451|EventStreamConfig: Add mediawiki.article_country_prediction_change stream (T382295)]] (duration: 16m 23s)

Change #1117318 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country prod config

https://gerrit.wikimedia.org/r/1117318

Change #1117318 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country prod config

https://gerrit.wikimedia.org/r/1117318

Thanks @Ottomata and @dcausse for the confirmation. The article-country model-server that supports both streams has been deployed in LiftWing production.

The model-server receives a mediawiki.page_change.v1 event, preprocesses it, produces a prediction, and publishes it in both streams as shown below:

1. mediawiki.article_country_prediction_change.v1
$ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.article_country_prediction_change.v1 -o -1 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq 'select(. | tostring | test("mediawiki.article_country_prediction_change.v1"))'

{
  "changelog_kind": "update",
  "page_change_kind": "edit",
  "dt": "2025-02-06T06:46:48Z",
  "wiki_id": "frwiki",
  "page": {
    "page_id": 532514,
    "page_title": "Saint-Léger-aux-Bois_(Oise)",
    "namespace_id": 0,
    "is_redirect": false
  },
  "performer": {
    "user_text": "SyntaxTerrorBot",
    "groups": [
      "bot",
      "*",
      "user",
      "autoconfirmed",
      "autopatrolled"
    ],
    "is_bot": true,
    "is_system": false,
    "is_temp": false,
    "user_id": 2357828,
    "registration_dt": "2015-10-20T07:57:54Z",
    "edit_count": 220045
  },
  "revision": {
    "rev_id": 222752395,
    "rev_dt": "2025-02-06T06:46:48Z",
    "is_minor_edit": true,
    "rev_sha1": "odiop6qy1tpzhh5s1n2oageri7o3cnj",
    "rev_size": 47265,
    "rev_parent_id": 220339143,
    "comment": "retrait {{sommaire|niveau=2}} (voir [[Wikipédia:Bot/Requêtes/2025/02#Retrait d'un modèle de sommaire]]) + corrections mineures",
    "editor": {
      "user_text": "SyntaxTerrorBot",
      "groups": [
        "bot",
        "*",
        "user",
        "autoconfirmed",
        "autopatrolled"
      ],
      "is_bot": true,
      "is_system": false,
      "is_temp": false,
      "user_id": 2357828,
      "registration_dt": "2015-10-20T07:57:54Z",
      "edit_count": 220045
    },
    "is_content_visible": true,
    "is_editor_visible": true,
    "is_comment_visible": true
  },
  "prior_state": {
    "revision": {
      "rev_id": 220339143,
      "rev_dt": "2024-11-16T12:41:23Z",
      "is_minor_edit": true,
      "rev_sha1": "lrsl75uelxr3qimai8dg158waq4a0up",
      "rev_size": 47293,
      "rev_parent_id": 220338684,
      "comment": "/* Introduction */ wikif ([[m:User:Jon Harald Søby/diffedit|diffedit]])",
      "editor": {
        "user_text": "Csar62",
        "groups": [
          "rollbacker",
          "*",
          "user",
          "autoconfirmed",
          "autopatrolled"
        ],
        "is_bot": false,
        "is_system": false,
        "is_temp": false,
        "user_id": 2786542,
        "registration_dt": "2017-04-19T07:57:24Z",
        "edit_count": 117603
      },
      "is_content_visible": true,
      "is_editor_visible": true,
      "is_comment_visible": true
    }
  },
  "$schema": "mediawiki/page/prediction_classification_change/1.1.0",
  "meta": {
    "stream": "mediawiki.article_country_prediction_change.v1",
    "id": "b6d2ac6f-a41b-4c32-b238-1e01a51eb83c",
    "request_id": "6a0aabaa-f341-4328-8047-6506eaa405e8",
    "domain": "fr.wikipedia.org",
    "uri": "https://fr.wikipedia.org/wiki/Saint-L%C3%A9ger-aux-Bois_(Oise)",
    "dt": "2025-02-06T06:46:53.139Z"
  },
  "predicted_classification": {
    "model_name": "article-country",
    "model_version": "1",
    "predictions": [
      "France"
    ],
    "probabilities": {
      "France": 1
    }
  }
}
2. mediawiki.cirrussearch.page_weighted_tags_change.rc0
$ kafkacat -C -b kafka-main1006.eqiad.wmnet:9093 -t codfw.mediawiki.cirrussearch.page_weighted_tags_change.rc0 -o -1 -e -X security.protocol=ssl -X ssl.ca.location=/etc/ssl/certs/wmf-ca-certificates.crt | jq 'select(. | tostring | test("classification.prediction.articlecountry"))'

{
  "$schema": "/development/cirrussearch/page_weighted_tags_change/1.0.0",
  "dt": "2025-02-06T06:43:46Z",
  "meta": {
    "stream": "mediawiki.cirrussearch.page_weighted_tags_change.rc0",
    "id": "918ae21b-f737-4940-bb31-5720b380fb75",
    "request_id": "f20fc27f-1f99-4ea6-b0ff-d8e31236dd60",
    "domain": "simple.wikipedia.org",
    "uri": "https://simple.wikipedia.org/wiki/Federal_Parliament_of_Nepal",
    "dt": "2025-02-06T06:43:48.446Z"
  },
  "page": {
    "namespace_id": 0,
    "page_id": 949310,
    "page_title": "Federal_Parliament_of_Nepal"
  },
  "weighted_tags": {
    "set": {
      "classification.prediction.articlecountry": [
        {
          "tag": "Nepal",
          "score": 1
        }
      ]
    }
  },
  "wiki_id": "simplewiki",
  "rev_based": true
}

As David shared above, setting and clearing of the classification.prediction.articlecountry tag can also be seen in this grafana dashboard:
https://grafana-rw.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=classification_prediction_articlecountry&var-search_cluster_site=eqiad&var-search_cluster=consumer-search

@Isaac @kevinbazira @SuchetaG

Is there any reason we shouldn't expose this stream and outlink topic model stream publicly at https://stream.wikimedia.org?

If we did T326179: Emit revision revert risk scores as a stream and expose in EventStreams API we could make that public too!

I can make a task for this if you like. :)

Change #1130627 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[integration/config@master] inference-services: trigger article-country CI on python dir change

https://gerrit.wikimedia.org/r/1130627

Change #1130627 merged by jenkins-bot:

[integration/config@master] inference-services: trigger article-country CI on python dir change

https://gerrit.wikimedia.org/r/1130627