Page MenuHomePhabricator

Update the article-country isvc to use Wikilinks for predictions
Closed, ResolvedPublic

Description

In the initial prototype of the article-country inference service, developed by the Research team, predictions were generated using 3 components (Wikidata Properties, Categories, and Wikilinks) to determine the country(ies) associated with a Wikipedia page. This prototype used a ~715MB SQLite database to manage Wikilink-related predictions, leading to a static and sizable dependency.

In T371897, the ML team productionized the article-country isvc and deployed it on LiftWing. This production version relies on 2 components (Wikidata Properties and Categories) to make predicitons. It integrates predictions into the Wikipedia Search index through the mediawiki.cirrussearch.page_weighted_tags_change.rc0 event stream, as shown in T382295.

In order for production version of this service to use Wikilinks to make predictions, we are going to use the classification.prediction.articlecountry weighted tags from the Wikipedia Search index instead of the static SQLite database dependency. This approach will enable the inference service to fetch up-to-date Wikilink predictions.

UPDATE
Following T385970#10548654, a meeting with the Search, Research, and ML teams (meeting notes) determined that relying on Wikilink predictions from the Wikipedia Search index is not viable because the cirrusdoc API is unstable.

Later on, the Research and ML teams evaluated the approaches shown below:

#ApproachImpact
1.Disable the Wikilinks feature temporarilyThis would result in a loss of approximately 20–30% of predictions (varying by wiki). While it avoids instability issues, it sacrifices a significant portion of prediction coverage.
2.Revert to the static SQLite database (~715MB)This option would restore most of the previous coverage despite being static. However, it introduces a larger dependency on LiftWing and poses challenges for timely updates, as current pipelines for this database are not optimal.
3.Conditionally use the Search APIThis would involve using the Search API only when Wikidata/Categories predictions are unavailable, possibly via an initial GET request to verify the existence of a value. However, this still results in static Wikilink predictions and might lead to inconsistent or unexplained results. There was also skepticism about achieving an implementation acceptable to the Search Platform.

After weighing these options, Option 2 was considered the most feasible. This approach aligns with practices already used by another model-server (reference-risk) that relies on a database dependency. The Research team has provided the SQLite database in P73436#294761, and we will integrate it into the article-country model-server to support Wikilink-related predictions.

Event Timeline

Thanks for kicking this off! Quick thoughts:

  • Here's the example API call I suggested (just added a redirects parameter too) that I think would get the data we need for each article. Should be pretty easy to get all the data with some logic equivalent to what we did for outlink topics, which was the same idea of gathering metadata for all of the outlinks in an article though there it was QIDs instead of weighted tags. In that example, the article for Kyoto has "classification.prediction.articlecountry/Japan|1000" in its result-set (though most pages obviously don't have data yet).
  • For the original implementation that used the SQLite database, we treated every country as a count of 1 (code). The weighted tags have confidence scores between 0 and 1000 so we'd just need to divide the score by 1000 and add that 0-1 value instead of the uniform 1. Links without any countries associated with them should still be represented as an empty string getting a 1 weight (code).
  • I put a limit of 500 links processed on the prototype API. Obviously most articles have far fewer than 500 links but doing a single API call for 100 links and their tags does seem like it can take a second or so. I don't think we have strong latency requirements for this model (it's not trying to support super low-latency use-cases at the moment like RecentChanges or Enterprise) so I figure that a few seconds for link-heavy articles is okay. In an ideal scenario, you're processing all of an article's links because the API call seems to retrieve the links in alphabetical order which means they aren't a truly random sample of links. But at the same time I'm not sure that fetching+processing all 2344 pagelinks from the Taylor Swift article (for example) is ideal either from a latency perspective and I'm hopeful that the first few hundred are a reasonable representation of the rest.
  • Let me know if other questions/issues come up and happy to try to think through them. Thanks!

Looping in @dcausse as well for context/guidance: as you can see, the plan for the article-country model is to add this additional component which uses the countries associated with all the links in a Wikipedia article to make additional country predictions -- e.g., if an article links to a bunch of articles that are about Japan, then it is also probably about Japan. The Search index will already be storing all of this information in weighted tags and has convenient access via a pagelinks generation + cirrus_doc API so the plan is to use that to gather the necessary data. Right now, most articles in the Search index are missing country predictions though because they are only be slowly added as articles are edited. I prepared a csv file (1.2 GB, 43,373,607 lines) at stat1008:/home/isaacj/topic_model/wiki-region-groundtruth/regions-cirrus-upload.csv that's based on data up to the end of 2024 with one line per country prediction (0 to multiple lines per article depending on number of predicted countries). It's easy to reformat so don't hesitate to request.

What's your guidance on doing a bulk upload to fill up the index?

wiki,pageID,country,weight
zhwiki,643793,Japan,1.0
jawiki,82776,Japan,1.0
kowiki,3054810,Japan,1.0
enwiki,1438941,Japan,1.0
huwiki,43274,Slovakia,1.0
itwiki,3479105,Slovakia,1.0
srwiki,749821,Slovakia,1.0
ptwiki,6221382,Slovakia,1.0
rowiki,2411943,Slovakia,1.0
...

For context, I also did a quick little analysis of most-recent edit dates for English Wikipedia articles:

SELECT
  LEFT(rev_timestamp, 6) AS month,
  COUNT(1) AS num_pages
FROM page p
INNER JOIN revision r
  ON (p.page_latest = r.rev_id)
WHERE
  p.page_namespace = 0
  AND page_is_redirect = 0
GROUP BY
  LEFT(rev_timestamp, 6)
ORDER BY
  month ASC

Results (by year):

Year	pages	percent
2008	637	0.01%
2009	1331	0.02%
2010	1609	0.02%
2011	1997	0.03%
2012	3270	0.05%
2013	8315	0.12%
2014	4405	0.06%
2015	6011	0.09%
2016	9878	0.14%
2017	17826	0.26%
2018	23378	0.34%
2019	64525	0.93%
2020	77168	1.11%
2021	216359	3.11%
2022	326822	4.70%
2023	839045	12.07%
2024	3769010	54.21%
2025	1581048	22.74%
Total:	6952634

As you can see, about 25% of articles on English Wikipedia haven't been edited since 2023 (or earlier). So a pretty substantial long-tail of content that won't get updated in the Search index if we rely on the edit stream without doing an initial upload of data. This has implications for the quality of the link-based predictions that this task is focused on but also means that downstream uses of the articlecountry tag will have pretty low coverage for a while. These low-edit articles are also often a great use-case for the recommender systems that will be using this country filter too because it is content that benefits from being surfaced to editors for updates.

As an aside, I was curious about these articles that haven't been updated since 2008 but they're mostly disambiguation pages -- e.g., https://en.wikipedia.org/wiki/Neil_MacFarlane. Those aren't easy to filter out in my quick query but Wikidata suggests that about 5% of English Wikipedia articles might be disambiguation pages so a conservative adjustment of the above would be 20% articles edited since 2023 or earlier.

SELECT (COUNT(?article) AS ?count) WHERE {
  ?item wdt:P31 wd:Q4167410.  # Instance of 'disambiguation page'
  ?article schema:about ?item .
  ?article schema:isPartOf <https://en.wikipedia.org/>.
}

@Isaac we have some utilities to push weighted tags into the search index from a spark job, 43M is a lot and we should be careful not to slow down real-time indexing while doing so. I'll prepare something and update the weighted tags documentation so that we can easily re-use it for future work.
Unfortunately our system requires a bit more info than what you have in your CSV:

  • namespace_id
  • page_title
  • page_id

I suppose we could easily join your csv with some other tables available in the datalake to get them, is this something you would be able to prepare while I work on a small code-snippet to push these predictions to the search index?

We probably want to group by page_id in order to send a single update in the case where multiple countries were predicted for the same page, if we work with spark this should be trivial.

@dcausse yep I can add those. thanks! New file: stat1008:/home/isaacj/topic_model/wiki-region-groundtruth/regions-cirrus-upload.tsv.gz

Notes:

  • Added the additional requested fields (just pulling from page table in our datalake so easy), ordered by wiki + page_id in case that helps, and also grouped countries for a page on a single line with a semicolon separator. So down to 39.6M rows from 43.3M. I also gzipped it just to make it slightly quicker to move around. Let me know if any other formatting/ordering changes would be helpful -- it's easy to adjust and re-run.
  • Given that we're dealing with data as of 01-01-2025, I assume that some titles/pageIDs in the dump won't align with the current Search Index but that whatever data we lose there is acceptable.
$ !ls -lht regions-cirrus-upload.tsv.gz
-rw-r--r-- 1 isaacj wikidev 504M Feb 11 17:58 regions-cirrus-upload.tsv.gz

$ !zless regions-cirrus-upload.tsv.gz | wc -l
39645675

$ !zless regions-cirrus-upload.tsv.gz | head
wiki_db	page_namespace	page_id	page_title	countries
abwiki	0	807	Аԥсуа_бызшәа	Russia;Iraq;Turkey;Georgia;Jordan;Syria
abwiki	0	1040	Аҟәа	Georgia
abwiki	0	1046	Гагра	Georgia
abwiki	0	1053	Аԥсны_Аҳәынҭқарра	Georgia
abwiki	0	1056	Гәдоуҭа	Georgia
abwiki	0	1058	Афон_Ҿыц	Georgia
abwiki	0	1059	Очамчыра	Georgia
abwiki	0	1062	Багаԥшь,_Сергеи_Уасил-иԥа	Georgia
abwiki	0	1635	Анқәаб,_Александр_Золотинска-иԥа	Russia;Georgia

@Isaac awesome, thanks! I'll get something ready this week and report back here.

Change #1119370 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: use weighted tags stream instead of SQLite db

https://gerrit.wikimedia.org/r/1119370

It looks like you are planning on using cirrusdoc directly, which isn't supposed to be a stable interface. I'm not sure I entirely understand what's going on there, we should probably discuss this more directly. I scheduled some time next week.

@Isaac the weight disappeared in your second tsv file, should I assume that all the predictions have to be set a score of 1.0?
I have some snippet code ready to run:

1import java.lang
2import org.apache.http.client.methods.HttpGet
3import org.apache.http.util.EntityUtils
4import org.apache.spark.sql.functions
5import org.apache.spark.sql.functions.{col, current_timestamp, date_format, lit, struct}
6import org.apache.spark.sql.types.LongType
7import org.codehaus.jackson.map.ObjectMapper
8import org.wikimedia.eventutilities.core.http.BasicHttpClient
9
10val wiki = "__WIKI__"
11
12def get_domain(w: String): String = {
13 val client = (BasicHttpClient.builder()
14 .addRoute("https://noc.wikimedia.org", "https://noc.wikimedia.org:443:mw-misc.discovery.wmnet:30443")
15 .httpClientBuilder()
16 .setUserAgent("wmf/import_articlecountry_T385970")
17 .build())
18 val objectMapper: ObjectMapper = new ObjectMapper()
19 val resp = client.execute(new HttpGet(f"https://noc.wikimedia.org/wiki.php?wiki=$wiki&format=json"))
20 if (resp.getStatusLine.getStatusCode != 200) {
21 throw new IllegalArgumentException(s"failed to get domain name for $wiki ${resp.getStatusLine}")
22 }
23
24 val node = objectMapper.readTree(EntityUtils.toString(resp.getEntity))
25 val domainName = node.get("wgCanonicalServer").asText().replaceAll("^https://", "")
26 resp.close()
27 domainName
28}
29val wikiDomainMap = Map(wiki -> get_domain(wiki))
30println(wikiDomainMap)
31
32case class WeightedTag(tag: String, score: Double)
33def to_weighted_tags_fn() = functions.udf((predictions: String) => {
34 val pred_array = predictions.split(";").map(c => c.split(":") match {
35 case Array(tag, score) => WeightedTag(tag = tag, score = lang.Double.parseDouble(score))
36 })
37 Map("classification.prediction.articlecountry" -> Array(pred_array:_*))
38})
39val to_weighted_tags = to_weighted_tags_fn()
40
41val uuid: String = java.util.UUID.randomUUID().toString
42println(f"Using request_id $uuid")
43
44(spark.read
45 .option("header", value = true)
46 .option("delimiter", "\t")
47 .csv("hdfs://analytics-hadoop/user/dcausse/topic_model/wiki-region-groundtruth/regions-cirrus-upload.tsv.gz")
48 .filter(col("wiki_db").equalTo(lit(wiki)))
49 .select(
50 struct(
51 lit(uuid).as("request_id"),
52 lit(wikiDomainMap(wiki)).as("domain")
53 ).as("meta"),
54 date_format(current_timestamp(), "yyyy-MM-dd'T'HH:mm:ss'Z'").as("dt"),
55 col("wiki_db").as("wiki_id"),
56 lit(false).as("rev_based"),
57 struct(
58 col("page_id").cast(LongType).as("page_id"),
59 col("page_namespace").cast(LongType).as("namespace_id"),
60 col("page_title").as("page_title")
61 ).as("page"),
62 struct(
63 to_weighted_tags(col("countries")).as("set")
64 ).as("weighted_tags")
65 )
66 .repartition(1)
67 .write
68 .format("wmf-event-stream")
69 .option("event-stream-name", "mediawiki.cirrussearch.page_weighted_tags_change.rc0")
70 .option("event-schema-base-uris", "https://schema.wikimedia.org/repositories/primary/jsonschema")
71 .option("event-stream-config-uri", "https://meta.wikimedia.org/w/api.php?action=streamconfigs")
72 .option("event-schema-version", "1.0.0")
73 .option("event-stream-topic-prefix", "eqiad")
74 .option("kafka.bootstrap.servers", "CHANGE_ME")
75 .option("rate-limit", 60)
76 // this is the default topic used by the kafka sink if a row does not specify one
77 .option("topic", "eqiad.mediawiki.cirrussearch.page_weighted_tags_change.rc0")
78 .save())
79
80System.exit(0)

I might start the back-fill on Monday.

@dcausse whoops - apologies that got lost in the format change but now fixed! Many of them are 1.0 but not all. I overwrote the same file so you'll just want to copy over again. When you do the predictions.split in line 42, you'll just want to them split each c further by a : to give the country and score (no country names have a colon in them). Thanks and just let me know if you run into any issues on back-fill.

$ ls -lht regions-cirrus-upload.tsv.gz
-rw-r--r-- 1 isaacj wikidev 512M Feb 14 18:53 regions-cirrus-upload.tsv.gz

$ zless regions-cirrus-upload.tsv.gz | wc -l
39645675

$ zless regions-cirrus-upload.tsv.gz | head
wiki_db	page_namespace	page_id	page_title	countries
abwiki	0	807	Аԥсуа_бызшәа	Russia:1.0;Iraq:1.0;Turkey:1.0;Georgia:1.0;Jordan:1.0;Syria:1.0
abwiki	0	1040	Аҟәа	Georgia:1.0
abwiki	0	1046	Гагра	Georgia:1.0
abwiki	0	1053	Аԥсны_Аҳәынҭқарра	Georgia:1.0
abwiki	0	1056	Гәдоуҭа	Georgia:1.0
abwiki	0	1058	Афон_Ҿыц	Georgia:1.0
abwiki	0	1059	Очамчыра	Georgia:1.0
abwiki	0	1062	Багаԥшь,_Сергеи_Уасил-иԥа	Georgia:1.0
abwiki	0	1635	Анқәаб,_Александр_Золотинска-иԥа	Russia:1.0;Georgia:1.0

@Isaac thanks! I started the backfil at 60 articles/sec (on a per wiki basis from smallest to biggest). At this rate this should take ~7days.

@dcausse much appreciated! Already seeing the results above reflected in abwiki -- e.g., pageID 807 which hasn't been edited since 10 January 2025 before the stream was started has all six countries as shown in my comment above: https://ab.wikipedia.org/w/api.php?action=query&prop=cirrusdoc&pageids=807&cdincludes=weighted_tags

Unfortunately I had to pause the backfill, we've burnt quite some budget of our update lag SLO (https://grafana-rw.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1).
I'm going to re-enable it at a slower pace but this means it probably won't finish by the end of the week...

Thanks for the update -- given that work is still in progress on the keyword and we've paused the use of cirrusdoc for link-based predictions, I think slower pace won't hurt anything.

Change #1124443 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] Makefile: download SQLite db used by article-country

https://gerrit.wikimedia.org/r/1124443

Change #1124443 merged by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] Makefile: download SQLite db used by article-country

https://gerrit.wikimedia.org/r/1124443

Change #1125126 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: update score normalization to support wikilink-related predictions

https://gerrit.wikimedia.org/r/1125126

Change #1125126 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: update score normalization to support wikilink-related predictions

https://gerrit.wikimedia.org/r/1125126

Change #1125398 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1125398

Change #1125398 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1125398

Change #1125661 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: add support for wikilink-related predictions

https://gerrit.wikimedia.org/r/1125661

Change #1125661 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: add support for wikilink-related predictions

https://gerrit.wikimedia.org/r/1125661

Change #1119370 abandoned by Kevin Bazira:

[machinelearning/liftwing/inference-services@main] article-country: use weighted tags stream instead of SQLite db

Reason:

following T385970#10548654, we ended up using an SQLite db as shown in: https://gerrit.wikimedia.org/r/1125661

https://gerrit.wikimedia.org/r/1119370

Change #1126963 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1126963

Change #1126963 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1126963

Change #1127410 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: fix image tags for article-country and articlequality in staging

https://gerrit.wikimedia.org/r/1127410

Change #1127410 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: fix image tags for article-country and articlequality in staging

https://gerrit.wikimedia.org/r/1127410

@Isaac, the model-server we worked on in P73436 has been deployed in LiftWing staging. Please test the internal endpoint shown below and let us know if you come across any issues before we proceed to production.

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Toni_Morrison"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
    "model_name":"article-country",
    "model_version":"1",
    "prediction":{
        "article":"https://en.wikipedia.org/wiki/Toni_Morrison",
        "wikidata_item":"Q72334",
        "results":[
            {
                "country":"United States",
                "score":1.0,
                "source":{
                    "wikidata_properties":[{"P27":"country of citizenship"}],
                    "categories":["Category:21st-century American women writers"],
                    "links":[{"country":"United States","count":246.0,"prop-tfidf":0.6274452843124155}]
                }
            }
        ]
    }
}


$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Elizabeth,_Lady_Thurles"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
    "model_name":"article-country",
    "model_version":"1",
    "prediction":{
        "article":"https://en.wikipedia.org/wiki/Elizabeth,_Lady_Thurles",
        "wikidata_item":"Q5362224",
        "results":[
            {
                "country":"United Kingdom",
                "score":0.41009373337153615,
                "source":{
                    "wikidata_properties":[],
                    "categories":[],
                    "links":[{"country":"United Kingdom","count":21.5,"prop-tfidf":0.41009373337153615}]
                }
            },
            {
                "country":"Ireland",
                "score":0.2905325201588147,
                "source":{
                    "wikidata_properties":[],
                    "categories":[],
                    "links":[{"country":"Ireland","count":9.0,"prop-tfidf":0.2905325201588147}]
                }
            }
        ]
    }
}

One small thing that I've noticed -- empty string countries can show up in the response which we want to filter out. Should be a simple fix (my code). Example:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Battle of Focșani"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "article-country",
  "model_version": "1",
  "prediction": {
    "article": "https://en.wikipedia.org/wiki/Battle of Foc\u0219ani",
    "wikidata_item": "Q1025134",
    "results": [
      {
        "country": "Romania",
        "score": 1.0,
        "source": {
          "wikidata_properties": [
            {
              "P625": "coordinate location"
            }
          ],
          "categories": [
            "Category:Military history of Romania"
          ]
        }
      },
      {
        "country": "Austria",
        "score": 0.5,
        "source": {
          "wikidata_properties": [],
          "categories": [
            "Category:Battles involving Austria"
          ]
        }
      },
      {
        "country": "Hungary",
        "score": 0.5,
        "source": {
          "wikidata_properties": [],
          "categories": [
            "Category:Battles involving Hungary"
          ]
        }
      },
      {
        "country": "",
        "score": 0.16788642542434984,
        "source": {
          "wikidata_properties": [],
          "categories": [],
          "links": [
            {
              "country": "",
              "count": 256,
              "prop-tfidf": 0.3357728508486997
            }
          ]
        }
      }
    ]
  }
}

Change #1127793 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: filter out empty country entries in wikilink predicitons

https://gerrit.wikimedia.org/r/1127793

Change #1127793 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: filter out empty country entries in wikilink predicitons

https://gerrit.wikimedia.org/r/1127793

Change #1127857 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1127857

Change #1127857 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1127857

One small thing that I've noticed -- empty string countries can show up in the response which we want to filter out.

Nice catch @Isaac! We have fixed this issue as shown below:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Battle of Focșani"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "article-country",
  "model_version": "1",
  "prediction": {
    "article": "https://en.wikipedia.org/wiki/Battle of Foc\u0219ani",
    "wikidata_item": "Q1025134",
    "results": [
      {
        "country": "Romania",
        "score": 1.0,
        "source": {
          "wikidata_properties": [
            {
              "P625": "coordinate location"
            }
          ],
          "categories": [
            "Category:Military history of Romania"
          ]
        }
      },
      {
        "country": "Austria",
        "score": 0.5,
        "source": {
          "wikidata_properties": [],
          "categories": [
            "Category:Battles involving Austria"
          ]
        }
      },
      {
        "country": "Hungary",
        "score": 0.5,
        "source": {
          "wikidata_properties": [],
          "categories": [
            "Category:Battles involving Hungary"
          ]
        }
      },
      {
        "country": "Greece",
        "score": 0.13117945627737737,
        "source": {
          "wikidata_properties": [],
          "categories": [],
          "links": [
            {
              "country": "Greece",
              "count": 25.0,
              "prop-tfidf": 0.26235891255475474
            }
          ]
        }
      }
    ]
  }
}

Please confirm whether we should proceed to prod. Thanks!

Hey -- as we discussed separately, let's update the code to still include the tfidf value for the empty country in the tfidf_sum for normalization purposes. This helps with not overestimating the relevance of countries when most links are to non-country-related articles. For the example of en:Nutmeg, I saw a difference of 0.32 for China currently dropping to 0.21 (i.e. below threshold) when the empty country was included. I think this matches better my expectations for the model too. This I think is just a matter of moving around where the skipping of empty countries happens so they're included for tfidf_sum += prop_tfidf but not computed_tfidf[country] = prop_tfidf. The validation that that works would be:

isaacj@stat1008:~$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Nutmeg"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "article-country",
  "model_version": "1",
  "prediction": {
    "article": "https://en.wikipedia.org/wiki/Nutmeg",
    "wikidata_item": "Q83165",
    "results": [ 
      {
        "country": "China",  # this result should disappear because prop-tfidf should drop to below the 0.25 threshold
        "score": 0.32170302901927567,
        "source": {
          "wikidata_properties": [],
          "categories": [],
          "links": [
            {
              "country": "China",
              "count": 43.0,
              "prop-tfidf": 0.32170302901927567
            }
          ]
        }
      }
    ]
  }
}

Change #1128323 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[machinelearning/liftwing/inference-services@main] article-country: include empty country links in tfidf_sum for normalization

https://gerrit.wikimedia.org/r/1128323

Change #1128323 merged by jenkins-bot:

[machinelearning/liftwing/inference-services@main] article-country: include empty country links in tfidf_sum for normalization

https://gerrit.wikimedia.org/r/1128323

Change #1128378 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1128378

Change #1128378 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update article-country image

https://gerrit.wikimedia.org/r/1128378

as we discussed separately, let's update the code to still include the tfidf value for the empty country in the tfidf_sum for normalization purposes.

We have updated the model-server as per your suggestion, and as noted in your validation, the en:Nutmeg request now no longer returns any wikilink-related prediction results:

$ curl "https://inference-staging.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Nutmeg"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
  "model_name": "article-country",
  "model_version": "1",
  "prediction": {
    "article": "https://en.wikipedia.org/wiki/Nutmeg",
    "wikidata_item": "Q83165",
    "results": []
  }
}

Looks great -- thanks!!

super! article-country wikilink-related predictions are now live in LiftWing production:

# pod running in eqiad
$ kube_env article-models ml-serve-eqiad
$ kubectl get pods
NAME                                                          READY   STATUS    RESTARTS   AGE
article-country-predictor-00008-deployment-5b6d9785c-9xfjh   3/3     Running   0          4m12s


# pod running in codfw
$ kube_env article-models ml-serve-codfw
$ kubectl get pod
NAME                                                          READY   STATUS    RESTARTS   AGE
article-country-predictor-00008-deployment-54fb689b88-6t22h   3/3     Running   0          2m18s


# isvc run successfully
$ curl "https://inference.svc.codfw.wmnet:30443/v1/models/article-country:predict" -X POST -d '{"lang": "en", "title": "Toni_Morrison"}' -H  "Host: article-country.article-models.wikimedia.org" -H "Content-Type: application/json" --http1.1

{
    "model_name":"article-country",
    "model_version":"1",
    "prediction":{
        "article":"https://en.wikipedia.org/wiki/Toni_Morrison",
        "wikidata_item":"Q72334",
        "results":[
            {
                "country":"United States",
                "score":1.0,
                "source":{
                    "wikidata_properties":[{"P27":"country of citizenship"}],
                    "categories":["Category:21st-century American women writers"],
                    "links":[{"country":"United States","count":246.0,"prop-tfidf":0.6274452843124155}]
                }
            }
        ]
    }
}