Page MenuHomePhabricator

Build mid-level WikiProject category training set
Closed, ResolvedPublic

Description

This dataset will have a set of mid-level WikiProject categories associated with pages.

Part 1 - Mid-level categories

E.g. Albert Einstein is tagged by lots of WikiProjects (Biography, Germany, Switzerland, History of Science, Jewish history, United States, Socialism, Mathematics, New Jersey, Philosophy, Physics, Astonomy, Religion). These WikiProjects fall into the following mid-level categories (based on the WikiProject Directory):

  • culture.biography
  • geography.countries
  • geography.americas
  • geography.europe
  • history_and_society.politics_and_government
  • stem.mathematics
  • stem.physics

We'll want to be able to classify articles ~an hour after creation. So we should probably also include the last revision saved within that time period.

So a record for this observation would look like this:

{
  "page_id": 736,
  "rev_id": 234036,
  "wp_categories": [
    "culture.biography",
    "geography.countries",
    "geography.americas",
    "geography.europe",
    "history_and_society.politics_and_government",
    "stem.mathematics",
    "stem.physics"
  ]
}

Part 2 - Pages associated with categories

  • This involves fetching a dataset of pages associated with each mid-level WikiProject.
  • Since each mid-level WikiProject is quite generic, it'll be associated with lots of WikiProjects, think about how to trim that.

Event Timeline

Halfak created this task.Aug 2 2017, 7:53 PM
Halfak updated the task description. (Show Details)Aug 2 2017, 7:56 PM
Sumit updated the task description. (Show Details)Sep 25 2017, 2:59 PM
Halfak assigned this task to Sumit.Nov 20 2017, 5:52 PM
Halfak moved this task from Review to Done on the Scoring-platform-team (Current) board.
Halfak moved this task from Done to Review on the Scoring-platform-team (Current) board.

Reviewed and merged! Let's get a dataset uploaded to figshare and then call this "done"!

Halfak closed this task as Resolved.Jan 30 2018, 8:32 PM