This dataset will have a set of mid-level WikiProject categories associated with pages.
Part 1 - Mid-level categories
E.g. Albert Einstein is tagged by lots of WikiProjects (Biography, Germany, Switzerland, History of Science, Jewish history, United States, Socialism, Mathematics, New Jersey, Philosophy, Physics, Astonomy, Religion). These WikiProjects fall into the following mid-level categories (based on the WikiProject Directory):
- culture.biography
- geography.countries
- geography.americas
- geography.europe
- history_and_society.politics_and_government
- stem.mathematics
- stem.physics
We'll want to be able to classify articles ~an hour after creation. So we should probably also include the last revision saved within that time period.
So a record for this observation would look like this:
{ "page_id": 736, "rev_id": 234036, "wp_categories": [ "culture.biography", "geography.countries", "geography.americas", "geography.europe", "history_and_society.politics_and_government", "stem.mathematics", "stem.physics" ] }
Part 2 - Pages associated with categories
- This involves fetching a dataset of pages associated with each mid-level WikiProject.
- Since each mid-level WikiProject is quite generic, it'll be associated with lots of WikiProjects, think about how to trim that.