Diego shared a dataset with topics for (almost) every article in all Wikipedias based on wiki-topics tool (by User:Isaac_(WMF)), using the Draft_topic taxonomy (by User:Halfak_(WMF)).
This dataset contains the predicted topic(s) for each Wikipedia article with a Wikidata item across languages.
- Schema for this dataset
Column Name | Description |
Qid | Wikidata Item Id |
topic | Topic based on the ORES draft topic model (https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic) |
probability | Probability to belong to the topic |
page_id | page_id |
page_title | page_title |
wiki_db | wiki_db, for example for english Wikipedia is enwiki |
- Taxonomy used in the dataset. The text in green are the topics (for example in line 50 you have topic Food & Drink)
- Test on model performance
- Wikidata coverage for Wikipedia articles (not including redirect pages), 99.7% in English Wikipedia.
We want to review this dataset to see if it could be used to create topic dimension in the content dataset for articles.