What use case is the model going to support/resolve?
Expanding the region component of the articletopic taxonomy to be country-specific. This work is happening as part of the WE2.1.1 hypothesis in the FY 24-25 Annual Plan.
Note that I'm currently thinking of this as a stand-alone model as opposed to an adjustment to the articletopic model. I'm open to discussing if we want to merge the two though I assume because they operate in different ways, it's probably smarter to keep it more modular and separate.
Do you have a model card?
Yes: https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Article_country
What team created/trained/etc.. the model? What tools and frameworks have you used?
Research -- very basic rule-based model at this stage so almost all the dependencies are actually in the form of API calls or pre-computed data that the model needs access to. The one exception is the shapely Python library for determining if a given lat-lon point is within a country.
What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?
For an input Wikipedia article, the following API calls are needed:
- Single call to Wikibase API for Wikidata properties (wbgetentities)
- Single call to Mediawiki API for categories (categories)
- Single call to Mediawiki API for pagelinks (links)
To transform these API results into predictions, the following data-dependencies are required (outside of a few hard-coded parameters in the code):
- GeoJSON of countries for determining lat-lon from Wikidata -> country (25M)
- TSV of categories for determining category -> country (2M)
- Actual list of countries (54K)
As you can see, small footprint. Main challenge is just having a good way of updating them. The GeoJSON and list of countries should be pretty static but the TSV of categories is something that ideally could be updated on a regular cadence (open to discussion about what's feasible). Not listed above but probably should be separated out (as opposed to hard-coded in) is a simple dictionary of tf-idf transformation values for each of the ~250 countries. This is pretty static but also should probably be occasionally refreshed.
If you have a minimal codebase that you used to run the first tests with the model, could you please share it?
https://github.com/wikimedia/research-api-endpoint-template/blob/region-api/model/wsgi.py
There is one major change between the above API code and how I expect it would work on LiftWing. The model has a step where it gathers all the wikilinks in the article (as represented by their Wikidata IDs) and then maps them to whatever countries they are associated with in order to determine if there's any countries that are prevalent enough in the links to be elevated to a prediction. This mapping of link-QIDs -> countries requires having the groundtruth of the model available for all Wikipedia articles as a fast look-up. Otherwise to make a prediction for a single article, the model might have to generate predictions first for e.g., the 50 other articles it is linked to (which obviously is not feasible). In the API above, I have solved that by quite having a simple SQLite database of all the articles and their predicted countries that I use for this wikilink inference stage. That's only 715MB so not an awful dependency but large enough to not be ideal. In the LiftWing API, I was envisioning actually depending on the Search API to serve this purpose. The goal is to have a pipeline similar to articletopic that loads the country predictions into the Search index. Then with a single call to the Mediawiki API, we can gather these predictions for all of a page's links and use these instead of the static database dependency (example API call). This would require some collaboration with Search to hopefully do an initial loading of the Search index and decide on the tag name that we're going to use etc. but would greatly simplify the LiftWing component.
State what team will own the model and please share some main point of contacts (see more info in '''Ownership of a model''').
Research
What is the current latency and throughput of the model, if you have tested it?
Fast and I haven't done any optimization. The three separate API calls listed above could be done in parallel, which likely would help a bit.
Is there an expected frequency in which the model will have to be retrained with new data? What are the resources required to train the model and what was the dataset size?
Retraining really is just refreshing the two core data dependencies on occasion. I'm open to discussion but even every six months or year would be fine:
- Category -> country mapping. This is something that I can work with Research Engineering to build an Airflow job for recalculating.
- Country tfidf values. Same as above, just a simple data job that I can work with Research Engineering to make into an Airflow job.
Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
More details in the model card but a quick summary. The model is combining three signals. Two of these (Wikidata properties and Wikipedia categories) are very "safe" in that they're just passing along decisions made by Wikimedians and the model doesn't add any further risk of harm. The model is inferring some countries from the wikilinks which does risk false positives that might be objectionable to some. To take a "trivial" case, the French article for meatball suggests that Tunisia is the associated country because of a Tunisian Cuisine navigation box with a ton of links that's at the bottom of the article. But in all these cases, we can at least point to which links lead to the inference so again I think the risk of offense is quite low. And the intended use of this model is as a filter so we won't be e.g., saying that "meatballs are Tunisian" but instead if someone requested a list of foods filtered to Tunisia, then meatball might show up on that list.
The other challenge is what is a "country". For this, we're using our official internal countries list that's based on ISO codes so at least it's defensible even if some might wish it to be slightly different.
Everything else that is relevant in your opinion.
- The model isn't quite ready for deployment -- I'm working on evaluation at the moment. But I'm creating this task with the hope of opening up discussion on whether you all see any potential issues in hosting so we can hopefully address them early before committing to a specific approach. Ideally we would work on deployment towards the end of Q1 (late September).
- I would also like a stream for this model to incorporate the predictions into the Search index. I probably lean towards adjusting the existing articletopic stream to also call this new model and then merge the predictions if that's possible. Otherwise it could be a separate stream.
- Right now, the model outputs just the country name -- e.g., American Samoa. This is different from the articletopic models which actually output a hierarchy of names. The equivalent for this model would be e.g., Oceania.Polynesia.American Samoa. This would be a very minor change as there's a direct mapping between country names and their full continent.subcontinent.country name, but I'm not sure at this point which is preferred by the end users.