Page MenuHomePhabricator

Outlinks model card
Closed, ResolvedPublic

Description

While we are waiting for the ml-serve cluster to go live (see: T287056), let's try making a first pass at a model card and/or datasheet.

We should aggregate all data here that fits into the following:

office hours video recording on model cards

  • Model title
  • Model description
  • Model created date
  • Model version

Description

  • Intented uses
  • Intented users
  • Out-of-scope*

Model Timeline

  • explanation of major changes
  • metrics
  • [what do you call all the versions of a model]

Training data

  • License

Interpretability Guide

Contact Us*

Difficulty tag to reproduce (as a barrier to entry)

Event Timeline

@ACraze my first pass at some of these for outlinks-based topic classification model:

== Basic Details
– Title: Outlinks-based Wikipedia Topic Classification
– Description: this model uses the wikilinks in a given Wikipedia article to predict which (0 to many) of a set of 64 topics are relevant to a given article -- e.g., Mt. Everest might reasonably be associated with South Asia, East Asia, Sports, Earth and the Environment.
– Type: fastText supervised multi-label classification
– Created date: January 2021
– Version: Prototype
– Paper: https://arxiv.org/abs/2103.00068
– Code License: MIT
– Model License: CC0 (?)
– Citation Details: ??
– Contact: ??
– Primary Developer: Wikimedia Foundation

== Description
– Intended uses: high-level analyses of Wikipedia dynamics such as pageview, article quality, or edit trends; filtering to relevant articles; cross-language comparisons
– Intended users: researchers, bots, editors, user scripts / gadgets, ??
– Out-of-scope: projects outside of Wikipedia, namespaces outside of 0, disambiguation pages, redirects.

== Factors
– Relevant factors: performance will suffer in articles with few outlinks, though generally this means precision remains high but recall drops. Atypical linking practices might also lead to inexplicable results, though this is difficult to define. In practice, the number of outlinks in a given Wikipedia article does vary by language edition, region of the world, and age of article.
– Evaluation factors: number of outlinks, topic

== Metrics
– Model performance measures: precision (threshold=0.5), recall (threshold=0.5), F1 (threshold=0.5), average-precision (threshold-agnostic)
– Decision thresholds: 0.5 for all labels
– Variation approaches: None (though experimentation has indicated the model is highly stable across runs and changes in hyperparameters)

== Architecture
– Labels: https://www.mediawiki.org/wiki/ORES/Articletopic#Taxonomy
– Groundtruth: based on English Wikipedia's WikiProject taxonomy as mapped to the taxonomy by the community and researchers. Labels for English Wikipedia articles then projected onto all other language versions of those articles. Labels reached 55% of all articles on Wikipedia.
– Prediction coverage: It is language-agnostic -- i.e. for any given language, it maps its links to Wikidata items and thus can make predictions for any language edition without additional preprocessing or training, though performance will vary by wiki.
– Training data: 90% sample of every language in Wikipedia. In practice, this meant that English Wikipedia provides 11.4% of the data and then Cebuano (8.8%), Swedish (6.4%), German (4.6%), French (4.2%), and all other languages are below 4%. Sampling is done by Wikidata ID, so all the language versions of a given article either appear in training, validation, or test but not across multiple splits.
– Test data: 8% sample of every language in Wikipedia (2% retained for validation).
– Parameters: 2 epochs; 0.1 learning rate; 20 window size; 20 min count; no pre-trained embeddings; 50-dimensional embeddings; 4,145,064 vocabulary size; 3200 model parameters and 207,253,200 embeddings parameters; 863 MB size on disk.
– Pipeline: wikilinks in an article (to other namespace 0 articles in that wiki) are mapped to their corresponding Wikidata IDs. If there is no Wikidata ID or their Wikidata ID is not within the vocabulary, that data is dropped. The resulting bag-of-WikidataIDs is fed into the model, which maps each ID to a 50-dimensional embedding, averages them together, and then uses multinomial logistic regression to predict labels.

== Quantitative Analyses
– In theory, could produce any mixture of: topic label, wiki language, region of world (country, sub-continent, continent, Global N/S), gender, # of outlinks, age of article
– Given the limitation that labels are only available for articles with English Wikipedia equivalents, hand-labeling of articles in several languages (Arabic (ar), Czech (cs), English (en), French (fr), and Vietnamese (vi)) was conducted by expert Wikipedia and largely validated the performance of the models while also clearly identifying some topic-specific caveats. See https://phabricator.wikimedia.org/T266201#6864397 for more information.

== Ethical Considerations
– This taxonomy was developed initially as a guide for discovering English WikiProjects. While some tweaks have been made to align it better with topic classification, it likely reflects English Wikipedia's interests and distinctions. Other language editions presumably would make different distinctions.

== Caveats and Recommendations
– While 0.5 is the suggested threshold, other thresholds may be more appropriate depending on the language and topic label. Notably, the raw scores from the model are not a measure of topical relevance but a measure of model confidence that a topic is relevant. Thus, a higher score does not mean a topic is more relevant and topics with clearer relevance -- e.g., geographic, biographies -- will generally have higher scores than topics with more ambiguous relevance -- e.g., society, education.
– Gaps in WikiProject coverage are known to lead to biases in recall for certain topics. For example, film labels largely are missing actors from Nollywood (Nigeria) and thus recall is lower for articles about Nigerian films and actors than Hollywood (US) films and actors.

OH wow @Isaac this is great! We are currently discussing if Wikitech or MediaWiki.org is a better spot for the model cards. And @Htriedman is working on a model card. I'd love to hear people's thoughts on this.

OH wow @Isaac this is great! We are currently discussing if Wikitech or MediaWiki.org is a better spot for the model cards. And @Htriedman is working on a model card. I'd love to hear people's thoughts on this.

@calbon thanks for the details! Yeah, I figured it'd be useful to try to summarize what I knew into a single place because otherwise it's split across the paper (https://arxiv.org/abs/2103.00068), meta (https://meta.wikimedia.org/wiki/Research:Language-Agnostic_Topic_Classification/Outlink_model_performance), some email threads, and my brain. No strong opinions about card design / location or urgency on my part. Just let me know what I can do to fill in gaps, brainstorm, etc.

That is really the value we are going for. And if it is on a wiki it opens up community contributions to the page and discussions via a talk page

@calbon sounds good — I'm also in the middle of putting the information spread across all the outlinks-model-related pages into my model card content v0.2 doc. That document is on google docs, but should be publicly accessible at this link. Looking forward to talking about it next week!