Page MenuHomePhabricator

Improve drafttopic training data pipeline
Closed, ResolvedPublic

Description

Per discussions, a few changes we want to make to the drafttopic training data pipeline:

  • Improve WikiProject directory (currently via the processing code but perhaps at later point through community-agreed-upon edits)
  • For example, WikiProject Brands does not appear in the directory but should probably be under History_And_Society.Business and economics and WikiProject Women Scientists is only included in Culture.Language and literature because that is where its main listing is, but (I believe) it should be under STEM.Science as well.
  • Pipeline for bringing together the following information in a single file for every article in English Wikipedia with at least one WikiProject template (currently 5.7M articles):
  • Article metadata: talk page ID, talk page revision, article title, article page ID, article revision ID, Wikidata ID (QID)
  • WikiProject templates: list of WikiProject-related templates from an article's talk page
  • Mid-level category tags: based on WikiProject templates and mapping from WikiProject directory

Example output JSON:

{
  "title": "Atlas Shrugged",
  "talk_pid": 128,
  "talk_revid": 911346471,
  "article_revid": 918538850,
  "article_pid": 18951386,
  "qid": "Q374098",
  "sitelinks": {
      "ro": "Revolta lui Atlas",
      "ja": "\u80a9\u3092\u3059\u304f\u3081\u308b\u30a2\u30c8\u30e9\u30b9",
      "is": "Undirsta\u00f0an",
      "ru": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0440\u0430\u0441\u043f\u0440\u0430\u0432\u0438\u043b \u043f\u043b\u0435\u0447\u0438",
      "zh": "\u963f\u7279\u62c9\u65af\u8073\u8073\u80a9",
      "fi": "Kun maailma j\u00e4rkkyi",
      "la": "Atlas Shrugged",
      "pt": "Atlas Shrugged",
      "de": "Atlas wirft die Welt ab",
      "no": "De som beveger verden",
      "vi": "Atlas Shrugged",
      "he": "\u05de\u05e8\u05d3 \u05d4\u05e0\u05e4\u05d9\u05dc\u05d9\u05dd",
      "fr": "La Gr\u00e8ve (roman d'Ayn Rand)",
      "nl": "Atlas Shrugged",
      "id": "Atlas Shrugged",
      "io": "La rebeleso di Atlas",
      "sq": "Revolta e Atlasit",
      "pl": "Atlas zbuntowany",
      "eo": "Atlas Shrugged",
      "da": "Og verden sk\u00e6lvede",
      "sv": "Och v\u00e4rlden sk\u00e4lvde",
      "es": "La rebeli\u00f3n de Atlas",
      "it": "La rivolta di Atlante",
      "bg": "\u0410\u0442\u043b\u0430\u0441 \u0438\u0437\u043f\u0440\u0430\u0432\u0438 \u0440\u0430\u043c\u0435\u043d\u0435",
      "ky": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0438\u0439\u0438\u043d\u0434\u0435\u0440\u0438\u043d \u043a\u0443\u0443\u0448\u0443\u0440\u0434\u0443",
      "fa": "\u0627\u0637\u0644\u0633 \u0634\u0648\u0631\u06cc\u062f",
      "simple": "Atlas Shrugged",
      "ca": "La rebel\u00b7li\u00f3 d'Atles",
      "hy": "\u0531\u057f\u056c\u0561\u0576\u057f\u0568 \u057a\u0561\u0580\u0566\u0565\u0581 \u0569\u0587\u0565\u0580\u0568",
      "cs": "Atlasova vzpoura",
      "en": "Atlas Shrugged",
      "af": "Atlas Shrugged",
      "ko": "\uc544\ud2c0\ub77c\uc2a4 (\uc18c\uc124)",
      "ar": "\u062d\u064a\u0646\u0645\u0627 \u0647\u0632 \u0623\u0637\u0644\u0633 \u0643\u062a\u0641\u064a\u0647",
      "uk": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0440\u043e\u0437\u043f\u0440\u0430\u0432\u0438\u0432 \u043f\u043b\u0435\u0447\u0456"
  },
  "wp_templates": [
      "wikiproject objectivism",
      "wikiproject novels",
      "wikiproject philosophy",
      "wikiproject libertarianism",
      "wikiproject politics",
      "wikiproject trains"
  ],
  "mid_level_categories": [
      "Culture.Philosophy and religion",
      "Culture.Language and literature",
      "History_And_Society.Politics and government",
      "History_And_Society.Transportation"
  ]
}

Event Timeline

@Halfak Current output bzipped JSON is on stat1007 at /home/isaacj/drafttopic/full_wptemplates.json.bz2

bzless on command-line or basic Python3 code to access:

import json
import bz2

with bz2.open('/home/isaacj/drafttopic/full_wptemplates.json.bz2', 'rt') as fin:
  for line in fin:
    article_json = json.loads(line)
    ...

The script for generating this file is on stat1007 at /home/isaacj/drafttopic/build_dataset.py. I have some cleaning / commenting to do on the file, but can plan to create a figshare item with the dataset (based on 2019-10-01 dumps) and script unless you think it should be part of the drafttopic code directory. I'll also be working on formalizing the changes that I made to the draftproject pipeline so I can make a pull request against https://github.com/wikimedia/drafttopic

Fantastic. I'll dig into this.

@MMiller_WMF and his team have been looking into improvements to make to the directory structure as well.

I did a bit of work trying to clean up the directory. It's worse than I expected with the level of messy-ness. Some categories don't make any sense. In other cases, there's a lot missing from some categories.

See my work here: https://etherpad.wikimedia.org/p/wikiproject_directory_topic_labels

I think the best way to move forward is to publish a cleaned up version of this directory and to version it for others re-use. I spent an hour on this and I'm about 20% done. I think that this wouldn't take too much work to clean up.

@Halfak I didn't want to mess with the work you had done and in some cases didn't know how to merge my suggestions, so I just added them to the etherpad after your section. For posterity, here's what I had come up with from my own work. In general I think they complement your suggestions, but we might need some meeting time as a group at some point to merge everything.

What I added to the etherpad:

These are just my personal opinions / observations / experimentation based on working with this taxonomy and I'm happy to talk through any of the suggestions.

Changes that should be made directly to the WikiProjects Directory:

  • Remove the link labeled "Cities of the United States" that is here ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical#Cities_of_the_United_States ). The section linked to there currently is missing and, as a result, our parsing of the directory associates all North/South American cities/countries with the label "Cities of the United States". It can be fixed in the code too, but I don't the link makes sense from a reader perspective either.
  • Last I checked, the following WikiProjects appear to be missing (entirely or from additional sections they should be under) from the directory. I also provide my suggested place in the directory:
    • WikiProject Awards and Prizes: History and Society -- History and Society
    • WikiProject Historic Sites: Geography -- Parks, conservation areas, and historical sites
    • WikiProject Brands: History and Society -- Business and economics
    • WikiProject Marketing & Advertising: History and Society -- Business and Economics
    • WikiProject Women Writers: Culture -- Biography
    • WikiProject Fictional Characters: Culture -- Literature
    • WikiProject Hugo and Nebula Award Winners: Culture -- Literature
    • WikiProject Musicians: Culture -- Music
    • WikiProject Women artists: Culture -- Arts
    • WikiProject Women scientists: STEM -- Science
    • I would argue that the following three WikiProjects should be moved from STEM.Biology to History and Society.History and society:
      • WikiProject Sexology and sexuality
      • WikiProject Gender Studies
      • WikiProject LGBT studies
    • I also support breaking out Language and Literature into People and Language/Literature. I'd put at least the following WikiProjects under Biography:
      • WikiProject Biography
      • WikiProject Living People
      • WikiProject Royalty and Nobility
      • WikiProject Crime and Criminal Biography
      • WikiProject English Numeral Royalty Redirect
      • WikiProject Genealogy
      • WikiProject Hugo and Nebula Award Winners
      • WikiProject Leaders by year
      • WikiProject Musicians
      • WikiProject Peerage and Baronetage
      • WikiProject Persondata
      • WikiProject Royalty annd Nobility
      • WikiProject Saints
      • WikiProject Women artists
      • WikiProject Women scientists

Changes that should be made to our derived taxonomy, but can reasonably be done via code:

  • I think we should toss out Geography.Countries and Geography.Cities -- all of the WikiProjects under these sections are currently associated with their respective regions (Americas, Asia, etc.) and the few WikiProjects whose data would be lost (WikiProject Cities, WikiProject Ghost Towns, WikiProject British Overseas Territories, WikiProject Countries, WikiProject Commonwealth) tend to duplicate the regions as well.
    • I think that this has the side effect though of removing WikiProject Antartica, which should be mapped to Geography.Antarctica or a more generic Geography topic
  • I want to toss out all of the Assistance topics -- they aren't cohesive, do matter w/r/t whether the label is out-of-date or not, and don't apply for the vast majority of use cases.
  • I think we should remove WikiProject Commonwealth from the directory. It adds a ton of noise because the Brits colonized pretty much everywhere and thus countries in pretty much every continent get smeared together as opposed to associated with their respective Geography.<Continent> region.
  • All WikiProjects in Culture.Music also appear in Culture.Performing arts -- I'd remove them from Culture.Performing arts so as to avoid redundancy
  • All WikiProjects in STEM.Engineering also appear in STEM.Technology -- I'd remove them from STEM.Technology so as to avoid redundancy
  • All WikiProjects in Culture.Games and toys also appear in Culture.Entertainment -- I'd remove them from Culture.Entertainment so as to avoid redundancy
  • I'd associate WikiProject Museums with Culture.Arts instead of Culture.Plastic Arts (which tends to refer specifically to architecture)
  • Name changes:
    • Geography.Parks, conservation areas, and. historical sites -> Geography.Parks
    • Culture.Plastic arts -> Culture.Architecture

Things about which I am currently undecided but want to flag :)

@Halfak -- I've finished going through each topic using the Growth team's prototype and I've added detailed notes about them to the etherpad at the bottom. My notes contain impressions about how each topic is or is not performing well and example articles to support that. I also included a link to a spreadsheet showing my attempt at a mapping between the current ORES topics and the topics that the Growth team thinks generally cover the spaces of interest that users have. It shows where certain topics have too much or too little detail in ORES. Please let me know what you think or if you have any questions or reactions.

@Halfak -- I wanted to flag for you that @RHo has completed designs that we want to use to implement topic matching in newcomer tasks: https://wikimedia.invisionapp.com/share/S3UV941DUWM#/screens/393723066 (you can navigate with your arrow keys).

Screens 3, 4, and 5 show what we intend to implement with the first version. But on screen 6, we are showing a design that utilizes multiple levels of topics. We think that users would have a good experience with two levels instead of one, and so I wanted to bring that up here as you and @Isaac
work on the training. I know that the existing drafttopic model has levels, but they are probably too few and too broad to be useful: STEM, Geography, Culture, History and Society. I'm not sure what a better set of top level groupings would be, but I wanted to let you know that if we do end up with a good set, we have designs in which we would want to use them. Otherwise we'll want to use something like 20 to 30 topics just on one level (as shown in screens 3, 4, and 5).

Please let us know (or @marcella / @Catrope) know if you have any thoughts or questions.

I think having multiple levels makes a lot of sense and I'm interested in what a proposal for a better top-level grouping would be.

I just shared a gdoc with @MMiller_WMF and @Isaac with a completed taxonomy. Once they have had a look at my work, I'll convert this to something machine readable and we can start experimenting with re-training the model.

Thanks @Halfak, this is awesome! I left a bunch of comments with the goal of trimming it down

I responded to a bunch of notes and I moved the data to a machine-readable format. See https://github.com/halfak/wikitax

I sent notes to @Halfak and @Isaac via email and in a spreadsheet. They sent notes back that I need to respond to.

Some follow-up to a conversation with @Halfak and @dr0ptp4kt :

This is how I have been adjusting the model outputs based on Wikidata properties in my Wikidata-based topic model, but the same adjustments would apply to ORES:

  • I create an additional topic "Compilation.List_Disambig" that includes List / Disambiguation articles. This is based on either the instance-of property (P31:Q4167410 for disambiguation pages; P31:Q13406463 for lists) or presence of the "is a list of" property (P360, which largely duplicates P31:Q13406463 but is kept for completeness).
  • If P625 (coordinate location) does not exist for a Wikidata item, I lowered the output confidence of any geography prediction by 0.5 (effectively removing them if the threshold is 0.5). This is still a big open question about how to handle articles tagged with Geography WikiProjects but that most people don't think of as Geography (e.g., famous people being tagged by the WikiProject for the state they were born)
  • I took any item with instance-of human (P31:Q5) and moved it into a Person topic and downgraded Culture.Language and Literature. This will be no longer necessary given the in-progress changes to the taxonomy!

The actual code that I use is here: https://github.com/geohci/wikidata-topic-model/blob/master/app/app.py#L30

On a related topic, if we're considering outputting Wikidata properties as part of an ORES prediction, I'd argue for at least the following:

  • Instance-of (P31) to cover the List/Disambiguation use-case and provide further data to help evaluate whether an article is about a person or not
  • Occupation (P106) to help further break down the Biography topic
  • Gender (P21) to help with filtering predictions around women scientists etc. where false positives can be particularly problematic
  • Coordinate location (P625) to help with evaluating geography predictions.
  • We could potentially consider an aggregator for country too but its design is not obvious -- e.g., outputting any countries listed under country (P17) for non-people, country-of-citizenship (P27) for people, or something more complex with place of birth (P19) or place of death (P20) for people as well.

@Halfak : the Python regexes that my NYU masters students developed for English/Hindi/Russian that might assist in preprocessing XML dump wikitext to model-ready tokens: https://github.com/mmarinated/topic-modeling/blob/master/baseline/data_creation/wiki_parser.py

They started with that wikifil.pl script as well for motivation and adapted it.

@Isaac looks like this is a straggler task so I'm resolving. Let me know if I'm missing something.

@Halfak Sounds good to me. I almost resolved it a while back but decided that I still wanted to incorporate some of Morten's suggestions into our process for grabbing the list of articles and WikiProject templates. I'll open a new task to do that work though.