Page MenuHomePhabricator

Determine evaluation strategy for article-country model
Closed, ResolvedPublic

Description

Decide strategy for evaluating the article-country model. It's not a straightforward evaluation for a few reasons:

  • Some of the countries are pulled directly from Wikidata (by definition we're essentially calling these 100% precision though we could verify this at some point)
  • The countries that are "missing" from Wikidata are not missing at random -- i.e. certain topics have very high country coverage on Wikidata such as people and country-of-citizenship but certain topics have very low coverage such as countries where a given species is endemic. The intent is for the model to use the links in the article to discover these missing countries but if we test it with the existing groundtruth, it'll tell us how well it works for articles where we don't need it but not tell us how well it works for articles where we do need it.
  • There is also the language component to consider -- this might be more or less effective in different languages based on their article lengths / linking norms.

In reality, I'll probably do a few things:

  • Testing the linking approach on the available groundtruth just to get a sense of its effectiveness and because it can be done easily.
  • Split articles into 1) those with countries from Wikidata, 2) those without Wikidata-based countries but with link-based predictions, and 3) those without Wikidata-based countries but with no link-based predictions. Verifying the results for group 1 and 3 should be pretty quick but maybe less important and so I can probably get by with a reasonable random sample. It's group 2 that probably takes the most time to validate because evidently there may be a connection between the subject of the article and the country but it might not be direct or explicit. For these, I'll more carefully build a sample to have folks evaluate for me.

Event Timeline

Weekly updates:

  • Took feedback that AS gathered from WikiProject Biodiversity folks around how to better incorporate species data into the model. This identified a new Wikidata property to include (taxon range) and the existence of categories on Wikipedia that are modeled as being explicitly connected to a given country on Wikidata. These could be used in the model but also present an opportunity as an independent signal to Wikidata and Wikipedia links that could be used to evaluate the recall of those approaches. I wrote up some quick code to make sure it was simple to extract them and see what sort of coverage they have. Details: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Countries/Species

Weekly updates:

  • Incorporated categories with explicit countries set per Wikidata into the prototype API.
  • Compared the overlap (notebook) between these three signals (Wikidata properties, article links, Wikipedia categories) for a small sample of 100 random articles on English Wikipedia to begin to get a sense of what's going on:
    • Most countries (77%) have at least one associated country
    • US dominates at about 1/3 of the countries predicted
    • Good agreement between Wikidata and wikilinks (~50%) when a country predicted
    • Category coverage is lowest but does provide provide an additional 10% of the countries
    • Wikilinks give the most lift by themselves (20%)
    • Wikidata only provides unique predictions 7% of the time
  • In summary: right now, all the signals seem valuable. There are a fair number of wikilink-based predictions, which are the least confident of the predictions, so those will warrant deeper evaluation by people at some point to verify they are precise. These conclusions could easily change as I expand my samples beyond English Wikipedia though.

Screenshot 2024-07-18 at 14-32-24 Notebook.png (2×2 px, 574 KB)

Weekly updates:

  • I prepared a data dump of all country predictions (+ topics) for all Wikipedia articles and made that public (data; README). I'll be able to use this to easily generate samples of articles to evaluate with people once the strategy is set.
  • This also allowed me to begin to identify subsets of the projects where country predictions are largely driven by wikilinks as the places that are probably most important to verify (as the wikilinks method has the highest likelihood of false positives). Biology (species) stood out as expected but so did Music (albums/songs/charts) and a few other wiki-specific areas (e.g., Chinese biographies). There's a ton of data and different ways to slice it so I haven't decided exactly yet how to visualize it but I can create visualizations like the one from last week for any wiki/topic combination now.

Weekly updates:

  • I took a pause from evaluation to focus on setting up conversations with ML Platform about hosting the model so we get those started earlier. It was also a good point to pause and reflect and do some writing to make sure I felt good about where we're at (I am). To that end, I put together a draft model card and am now working on filling out the phab request for hosting the model on LiftWing.
  • I also uploaded the code for some of the data pipelines I built for generating the bulk predictions: https://github.com/geohci/wiki-region-groundtruth/tree/main/notebooks

Weekly updates:

  • Paused evaluation to focus on putting in requests for hosting model on LiftWing (T360455), Research Engineering support for the data dependencies (T371900), and Search platform support for incorporating the predictions as tags (T301671#10052411). These should hopefully help identify any red flags with the approach early in the process and make sure the productionization goes smoothly when we're ready to move to that stage.

Weekly updates:

  • I started with a random sample of 50 articles to evaluate myself to get an early sense and see how difficult it was. Generally I found it to be pretty fast -- using Google Translate to read the article works for most language editions plus inspecting Wikidata when useful meant that I could do most of them in about 30 seconds.
  • Out of the 50, I was in agreement with the predicted labels for 44 (88%), which is a good sign (full data)! And if you treat each prediction as independent, the model is at 100% precision with just imperfect recall, which is a much better situation than having incorrect predictions.
  • Out of the six disagreements:
    • Three were species-related with endemic areas that are mentioned in the article not being predictions (low recall). This is a known issue but we've identified strategies for improving coverage. Additionally, species articles make up a substantial proportion of articles on Wikipedia due to bot generation and simple notability criteria, but they do not receive a proportionate level of attention so I'm okay with lower recall in this topic area.
    • Two were a version of "how much time do you have to spend in a country for it to be relevant to your life?". One was a painter who moved from UK to US at age 8 and I assigned the UK but the model only predicted US. The other was a British diplomat to France who I assigned UK + France and model only predicted UK. Again, both are recall issues and borderline cases so I'm okay with this.
    • The final one was a Canadian film by an Ethiopian filmmaker that is set in Ethiopia, so I assigned Canada+Ethiopia but the model only predicted Canada. More links to the Ethiopia component in the article or Wikidata properties that mentioned Ethiopia probably would have led to alignment but there's no obvious fix here and again it's an issue of recall, which I'm comfortable with.

Weekly updates:

  • Starting to put together a more formal set of articles for evaluation. As I've said before, this is challenging because of how broad the space is. If you think about the user experience of "quality" for this model, I assume most will be using it only ever for a specific country (probably where they live) and specific language (their "home" wiki). So their perception of quality will be for that very narrow slice of the model. If we were to provide coverage for all of these entrypoints, however, that would be ~300 languages x 250 countries, which is 75,000 prediction spaces where you'd want at least 10 to get a sense and ideally 50 outputs evaluated to know robustly whether the model will perform well for that user need. Obviously that is way too many model outputs for individuals to evaluate. Even just doing a separate sample for every country on a single wiki would be several thousand articles that would need evaluation. So we have to be much more efficient about how we choose areas to focus on that should hopefully leave us with a good sense of model performance while catching perhaps some of these edge-cases where a particular wiki+country intersection might have issues so we can do something about it.
  • My goal with this evaluation is two-fold: 1) get enough general data to convince Product stakeholders that the model is good enough to proceed, and, 2) identify any areas where the model is making mistakes that we can actually fix in some way. Eventually, we might have specific language communities who want to do some additional evaluation before enabling the model in some way, but that would be addressed through much more focused evaluations related to whatever their concerns are -- i.e. not the goal here.
  • My proposed approach (I'll talk with Miriam about this when she's back) is as follows:
    • Get someone else to look at my random sample of 50 articles (and maybe extend to 100 to be a bit more robust). This will be the top-level metric that we share that says overall how well the model performs.
    • Do a more focused evaluation of a few countries where we're seeing high rates of predictions coming from the wikilinks alone (this is the most uncertain signal in the model and so where false positives are most likely to occur). I put some data on the proportion of predictions for a given country (across all language editions) that were only from wikilinks below. I'll probably draw a small sample (~10) from each country that's above e.g., 20% (0.2) of predictions based on wikilinks (so everything down to Papau New Guinea) and check those to see if any clear issues are identified. I will note that the way that the topic keywords work on Search, we could easily set it so country predictions coming from categories/wikidata are weighted more heavily and so show up higher in the search results. This would reduce the likelihood of false positives being shown to the users if the wikilinks are generating any.
    • Do a more focused evaluation of English Wikipedia as well. This is because Content Translation is a likely user of these model outputs and many translations start from English Wikipedia as the source, so the country filter is much more likely to be applied in that setting. This will likely be a simple random sample too. When I did the proportion-from-wikilinks for just English Wikipedia to see if any outliers stood out, it looked pretty similar to the ranking below (not surprising given size of English Wikipedia).
Proportion of predictions coming from wikilinks only:
* Vatican City (n=44431): 0.723
* United States Minor Outlying Islands (n=1754): 0.348
* China (n=908175): 0.283
* Taiwan (n=144142): 0.282
* Thailand (n=180533): 0.273
* United Kingdom (n=2548277): 0.267
* French Polynesia (n=10010): 0.262
* Mozambique (n=39652): 0.257
* Hong Kong (n=108513): 0.249
* Palestine (n=50793): 0.241
* Japan (n=1831751): 0.226
* Netherlands (n=513409): 0.224
* Pakistan (n=190911): 0.220
* South Korea (n=431625): 0.212
* Iraq (n=92475): 0.211
* Rwanda (n=24673): 0.210
* Papua New Guinea (n=34147): 0.208
* Ireland (n=216665): 0.198
* Puerto Rico (n=30474): 0.192
* Georgia (n=117464): 0.191
* United States (n=7663219): 0.186
* India (n=1173213): 0.184
* Vietnam (n=154476): 0.180
* North Korea (n=53224): 0.180
* Zambia (n=34238): 0.178
* Myanmar (n=76307): 0.176
* Norway (n=857431): 0.174
* Bonaire, Sint Eustatius, and Saba (n=2011): 0.169
* Malaysia (n=171801): 0.164
* Czech Republic (n=493230): 0.162
* Greece (n=306967): 0.162
* Macao (n=8349): 0.157
* Poland (n=1160628): 0.157
* Germany (n=2485701): 0.157
* Israel (n=257611): 0.156
* Mongolia (n=40270): 0.151
* Belgium (n=396716): 0.150
* Venezuela (n=108629): 0.149
* Sweden (n=746365): 0.149
* Egypt (n=175477): 0.148
* East Timor (n=18236): 0.147
* Denmark (n=322453): 0.147
* Uzbekistan (n=65308): 0.146
* South Africa (n=223438): 0.145
* Slovakia (n=208130): 0.141
* Finland (n=473536): 0.139
* Réunion (n=8145): 0.137
* Australia (n=675264): 0.135
* Canada (n=1038515): 0.134
* Turkey (n=585646): 0.134
* Kosovo (n=32826): 0.133
* Ukraine (n=718982): 0.128
* New Caledonia (n=11472): 0.128
* Afghanistan (n=102346): 0.128
* Bosnia and Herzegovina (n=153326): 0.128
* North Macedonia (n=95044): 0.127
* Peru (n=185711): 0.126
* Curaçao (n=5249): 0.126
* Sudan (n=37684): 0.123
* Austria (n=472502): 0.123
* Faroe Islands (n=17032): 0.122
* Saudi Arabia (n=79378): 0.122
* Belarus (n=314330): 0.121
* Nepal (n=84563): 0.121
* Jamaica (n=41112): 0.121
* Estonia (n=206831): 0.119
* Guadeloupe (n=6345): 0.118
* Wallis and Futuna (n=1511): 0.117
* Portugal (n=270499): 0.116
* Italy (n=1782935): 0.115
* Jersey (n=3967): 0.114
* Tajikistan (n=40029): 0.113
* Azerbaijan (n=167402): 0.112
* French Guiana (n=6980): 0.112
* Hungary (n=357849): 0.111
* Switzerland (n=459430): 0.111
* Croatia (n=252385): 0.110
* Serbia (n=258798): 0.110
* Luxembourg (n=58242): 0.107
* Colombia (n=174754): 0.107
* Northern Mariana Islands (n=2398): 0.106
* Indonesia (n=682673): 0.106
* Kazakhstan (n=180745): 0.105
* Brazil (n=676784): 0.105
* Russia (n=1869024): 0.103
* New Zealand (n=239703): 0.103
* Bangladesh (n=99091): 0.103
* Cambodia (n=33453): 0.102
* Cyprus (n=46417): 0.101
* Romania (n=411740): 0.101
* Spain (n=1437708): 0.099
* Philippines (n=175378): 0.099
* Greenland (n=24245): 0.098
* Chile (n=174890): 0.098
* Iran (n=686007): 0.098
* Nigeria (n=116372): 0.097
* Iceland (n=89172): 0.096
* Democratic Republic of the Congo (n=70011): 0.096
* Zimbabwe (n=35703): 0.094
* Latvia (n=118147): 0.094
* Armenia (n=122873): 0.094
* Lebanon (n=69408): 0.094
* Cook Islands (n=4037): 0.093
* Argentina (n=317425): 0.093
* Syria (n=90662): 0.092
* Slovenia (n=184418): 0.091
* Bulgaria (n=221694): 0.090
* Sri Lanka (n=69977): 0.089
* Madagascar (n=61738): 0.089
* Saint Helena, Ascension, and Tristan da Cunha (n=2329): 0.088
* Bolivia (n=69820): 0.088
* Montenegro (n=52753): 0.086
* Suriname (n=19837): 0.086
* Kenya (n=62480): 0.084
* Kyrgyzstan (n=45266): 0.084
* Angola (n=62117): 0.082
* Ethiopia (n=51519): 0.082
* French Southern and Antarctic Lands (n=2342): 0.081
* Dominican Republic (n=43399): 0.081
* Guatemala (n=37976): 0.080
* Moldova (n=53961): 0.079
* Saint Pierre and Miquelon (n=1340): 0.079
* Lithuania (n=200725): 0.077
* Malawi (n=17300): 0.077
* Panama (n=38327): 0.077
* Guam (n=4117): 0.077
* Tanzania (n=53827): 0.077
* Belize (n=13861): 0.077
* Chad (n=20977): 0.076
* Somalia (n=29557): 0.075
* Solomon Islands (n=15272): 0.075
* Sint Maarten (n=1684): 0.075
* Qatar (n=30621): 0.072
* Libya (n=36258): 0.072
* Central African Republic (n=21536): 0.072
* Liechtenstein (n=14432): 0.071
* Mexico (n=886800): 0.070
* Martinique (n=5549): 0.068
* Bahamas (n=11716): 0.067
* United Arab Emirates (n=42510): 0.067
* Fiji (n=28387): 0.066
* Cuba (n=80154): 0.066
* Aruba (n=4824): 0.065
* Isle of Man (n=6895): 0.064
* Åland (n=6770): 0.063
* Guernsey (n=4732): 0.063
* Tunisia (n=68053): 0.063
* Maldives (n=11794): 0.063
* Antarctica (n=114809): 0.063
* Uruguay (n=77602): 0.062
* Malta (n=30982): 0.062
* Federated States of Micronesia (n=7429): 0.062
* Cameroon (n=55648): 0.062
* France (n=3082505): 0.062
* Vanuatu (n=11305): 0.061
* Singapore (n=45465): 0.061
* Ecuador (n=69581): 0.060
* Ghana (n=53856): 0.059
* Falkland Islands (n=3931): 0.059
* Monaco (n=20429): 0.059
* Haiti (n=30331): 0.058
* South Sudan (n=17302): 0.058
* Niger (n=22993): 0.058
* British Virgin Islands (n=2719): 0.057
* Laos (n=26128): 0.057
* Paraguay (n=37136): 0.057
* Republic of the Congo (n=17755): 0.056
* Senegal (n=33470): 0.056
* Algeria (n=135612): 0.056
* San Marino (n=13572): 0.056
* American Samoa (n=4220): 0.055
* Kuwait (n=20562): 0.055
* Honduras (n=32323): 0.054
* Mayotte (n=1937): 0.054
* Mali (n=35458): 0.053
* Turkmenistan (n=20220): 0.053
* Tokelau (n=929): 0.053
* Jordan (n=37198): 0.053
* Djibouti (n=12545): 0.052
* Namibia (n=25727): 0.052
* Marshall Islands (n=6045): 0.052
* Uganda (n=40580): 0.051
* Guinea (n=21210): 0.051
* Guyana (n=14314): 0.051
* El Salvador (n=23430): 0.050
* Costa Rica (n=37113): 0.049
* Burkina Faso (n=37275): 0.049
* Albania (n=88489): 0.048
* Sierra Leone (n=14063): 0.048
* Burundi (n=19166): 0.047
* United States Virgin Islands (n=4997): 0.044
* Ivory Coast (n=38928): 0.044
* Gibraltar (n=6230): 0.044
* Cape Verde (n=14811): 0.044
* Benin (n=21554): 0.044
* Yemen (n=109190): 0.043
* Bermuda (n=5582): 0.043
* Morocco (n=182282): 0.043
* Trinidad and Tobago (n=18836): 0.043
* Andorra (n=18232): 0.043
* Mauritius (n=12645): 0.042
* Oman (n=23389): 0.042
* São Tomé and Príncipe (n=7532): 0.040
* Gambia (n=11499): 0.039
* Seychelles (n=8461): 0.039
* Pitcairn Islands (n=1367): 0.038
* Tonga (n=9552): 0.038
* Kiribati (n=6381): 0.036
* South Georgia and the South Sandwich Islands (n=5650): 0.036
* Nicaragua (n=22673): 0.035
* Liberia (n=15047): 0.035
* Bouvet Island (n=495): 0.034
* Samoa (n=11049): 0.034
* Brunei (n=11811): 0.034
* Palau (n=6284): 0.034
* Guinea-Bissau (n=11125): 0.033
* Bhutan (n=15877): 0.032
* Gabon (n=17376): 0.032
* Eritrea (n=16193): 0.032
* Mauritania (n=18744): 0.032
* Anguilla (n=2095): 0.032
* Comoros (n=8450): 0.031
* Western Sahara (n=5635): 0.031
* Botswana (n=19557): 0.029
* Togo (n=14225): 0.029
* Cayman Islands (n=2545): 0.028
* Equatorial Guinea (n=11861): 0.027
* Montserrat (n=1900): 0.026
* Eswatini (n=8506): 0.026
* Barbados (n=9475): 0.025
* Grenada (n=6771): 0.024
* Lesotho (n=9031): 0.024
* Bahrain (n=17556): 0.024
* Turks and Caicos Islands (n=1713): 0.023
* Dominica (n=5993): 0.022
* Saint Martin (n=1076): 0.021
* Norfolk Island (n=1267): 0.021
* Tuvalu (n=4846): 0.020
* British Indian Ocean Territory (n=667): 0.019
* Niue (n=1954): 0.019
* Antigua and Barbuda (n=7524): 0.017
* Saint Barthélemy (n=742): 0.015
* Christmas Island (n=687): 0.015
* Nauru (n=4697): 0.014
* Saint Kitts and Nevis (n=5674): 0.014
* Saint Vincent and the Grenadines (n=5601): 0.012
* Saint Lucia (n=6662): 0.012
* Cocos (Keeling) Islands (n=406): 0.002
* Heard Island and McDonald Islands (n=156): 0.000

Full evaluation complete! Results can be found here: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/article-country-model/qual-code-stats.ipynb

Summary:

  • On a random sample of 100 Wikipedia articles, we had 0.975 precision and 0.772 recall. Which is to say, the model didn't get every country that it should have but it did only make two incorrect predictions. Both were to the UK (one because an American band played music that originated in part in the UK and one because a chess move was written about by a bunch of British people). So this far exceeds the 70% precision and 50% recall metrics we set as a baseline for this hypothesis (victory)!
  • On a second random sample of 50 articles just from English Wikipedia, we saw similarly high numbers: 0.950 precision and 0.691 recall. Two incorrect predictions again — 1 was Germany for an article about a tractor made by a German company (so I'd say wrong but not unreasonable) and 1 was the US for a general theatre term that happened to just have a bunch of US-based examples in it. Again, very happy with this result and suggests very low levels of bad user surprise if we deploy the model.
  • Then we took a deeper look into a subset of 170 articles where I was seeing less confident predictions and so wanted to dig further. These are not at all representative of the whole — mainly a "worse-case" situation. For these, we still had a precision at 0.675 and recall of 0.829, so really still not bad at all. But it pointed to a few areas to dig deeper in the long-term. For example, the US Minor Outlying Islands were getting predicted for article about US states because some languages seem to have a navbox they add with links to every US state and territory (including the outlying islands). This threw off the model and suggests that filtering some of these template-based links might be helpful. Vatican City was being applied to many Roman Catholic priests, even those posted in other countries. This maybe is reasonable but not fully expected I think. I'll be digging into a few other countries where we see this sort of thing happen a bit more too.

I'll likely wait a week but this essentially closes out this task.

Final report made:

Outcome: The hypothesis was supported. The model achieves an overall precision of 97% and recall of 77%, which exceeds the 70% precision and 50% recall baseline that was set.
Overview: Over the course of this hypothesis, the model was improved from a single source of data (Wikidata) to three sources (+categories, +links), which resulted in a substantive increase in recall (7% overall but much higher in certain wikis / countries — e.g., +25% increase in English Wikipedia) without any major drop in precision. Code was written to make the model functional as an API so that it can next be hosted by the ML Platform on LiftWing and formally evaluated for accuracy.
Relevant documents:
    Documentation: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Countries
    Training code: https://github.com/geohci/wiki-region-groundtruth/tree/main/notebooks
    API code: https://github.com/wikimedia/research-api-endpoint-template/blob/region-api/model/wsgi.py
    Final evaluation: https://public-paws.wmcloud.org/User:Isaac%20(WMF)/article-country-model/qual-code-stats.ipynb
Major lessons:
    Working with specific sub-communities where we see lower performance can help in identifying easy ways to improve the model – it was a focus on species in the model that lead to identifying categories as a useful signal: https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Countries/Species
    Links introduced via templates continue to be a sometimes useful but sometimes problematic source of data (in this case, links in navboxes or references are one major source of incorrect predictions by the model). The pagelinks table/API lacks this context (though is a very low-latency way of accessing the links in an article) but future work might investigate the trade-offs from extracting links via article HTML instead (more control but higher latency).
Potential next steps:
    Host on LiftWing (ML Platform): https://phabricator.wikimedia.org/T371897
    Formalize training code (Research): https://phabricator.wikimedia.org/T371900
    Incorporate predictions into Search Index (Search Platform): https://phabricator.wikimedia.org/T301671
    Then LPL can incorporate into the topic suggestions on Content Translation: https://phabricator.wikimedia.org/T113257