Improve drafttopic training data pipeline
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	Oct 28 2019, 5:54 PM

Description

Per discussions, a few changes we want to make to the drafttopic training data pipeline:

Improve WikiProject directory (currently via the processing code but perhaps at later point through community-agreed-upon edits)
For example, WikiProject Brands does not appear in the directory but should probably be under History_And_Society.Business and economics and WikiProject Women Scientists is only included in Culture.Language and literature because that is where its main listing is, but (I believe) it should be under STEM.Science as well.

Pipeline for bringing together the following information in a single file for every article in English Wikipedia with at least one WikiProject template (currently 5.7M articles):
Article metadata: talk page ID, talk page revision, article title, article page ID, article revision ID, Wikidata ID (QID)
WikiProject templates: list of WikiProject-related templates from an article's talk page
Mid-level category tags: based on WikiProject templates and mapping from WikiProject directory

Example output JSON:

{
  "title": "Atlas Shrugged",
  "talk_pid": 128,
  "talk_revid": 911346471,
  "article_revid": 918538850,
  "article_pid": 18951386,
  "qid": "Q374098",
  "sitelinks": {
      "ro": "Revolta lui Atlas",
      "ja": "\u80a9\u3092\u3059\u304f\u3081\u308b\u30a2\u30c8\u30e9\u30b9",
      "is": "Undirsta\u00f0an",
      "ru": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0440\u0430\u0441\u043f\u0440\u0430\u0432\u0438\u043b \u043f\u043b\u0435\u0447\u0438",
      "zh": "\u963f\u7279\u62c9\u65af\u8073\u8073\u80a9",
      "fi": "Kun maailma j\u00e4rkkyi",
      "la": "Atlas Shrugged",
      "pt": "Atlas Shrugged",
      "de": "Atlas wirft die Welt ab",
      "no": "De som beveger verden",
      "vi": "Atlas Shrugged",
      "he": "\u05de\u05e8\u05d3 \u05d4\u05e0\u05e4\u05d9\u05dc\u05d9\u05dd",
      "fr": "La Gr\u00e8ve (roman d'Ayn Rand)",
      "nl": "Atlas Shrugged",
      "id": "Atlas Shrugged",
      "io": "La rebeleso di Atlas",
      "sq": "Revolta e Atlasit",
      "pl": "Atlas zbuntowany",
      "eo": "Atlas Shrugged",
      "da": "Og verden sk\u00e6lvede",
      "sv": "Och v\u00e4rlden sk\u00e4lvde",
      "es": "La rebeli\u00f3n de Atlas",
      "it": "La rivolta di Atlante",
      "bg": "\u0410\u0442\u043b\u0430\u0441 \u0438\u0437\u043f\u0440\u0430\u0432\u0438 \u0440\u0430\u043c\u0435\u043d\u0435",
      "ky": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0438\u0439\u0438\u043d\u0434\u0435\u0440\u0438\u043d \u043a\u0443\u0443\u0448\u0443\u0440\u0434\u0443",
      "fa": "\u0627\u0637\u0644\u0633 \u0634\u0648\u0631\u06cc\u062f",
      "simple": "Atlas Shrugged",
      "ca": "La rebel\u00b7li\u00f3 d'Atles",
      "hy": "\u0531\u057f\u056c\u0561\u0576\u057f\u0568 \u057a\u0561\u0580\u0566\u0565\u0581 \u0569\u0587\u0565\u0580\u0568",
      "cs": "Atlasova vzpoura",
      "en": "Atlas Shrugged",
      "af": "Atlas Shrugged",
      "ko": "\uc544\ud2c0\ub77c\uc2a4 (\uc18c\uc124)",
      "ar": "\u062d\u064a\u0646\u0645\u0627 \u0647\u0632 \u0623\u0637\u0644\u0633 \u0643\u062a\u0641\u064a\u0647",
      "uk": "\u0410\u0442\u043b\u0430\u043d\u0442 \u0440\u043e\u0437\u043f\u0440\u0430\u0432\u0438\u0432 \u043f\u043b\u0435\u0447\u0456"
  },
  "wp_templates": [
      "wikiproject objectivism",
      "wikiproject novels",
      "wikiproject philosophy",
      "wikiproject libertarianism",
      "wikiproject politics",
      "wikiproject trains"
  ],
  "mid_level_categories": [
      "Culture.Philosophy and religion",
      "Culture.Language and literature",
      "History_And_Society.Politics and government",
      "History_And_Society.Transportation"
  ]
}

Related Objects
Search...

Status	Assigned	Task
Resolved	Halfak	T243451 Deploy ORES -- Late Jan 2020
Resolved	Halfak	T235181 Build WikiProject directory topic models for ar, cs, and kowiki
Resolved	Halfak	T235183 Experiment with different vector lengths for ar, cs, en, and kowiki topic models.
Resolved	Halfak	T235187 Create labeled data for topic models in ar, cs, kowiki
Resolved	• Rileych	T240517 [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics)
Resolved	Isaac	T236713 Improve drafttopic training data pipeline
Resolved	Isaac	T240273 Extract cross-wiki WikiProject tags
Resolved	Halfak	T240286 Re-train English Wikipedia topic model using new WikiProject Taxonomy
Resolved	Halfak	T240276 Restructure WikiProject directory to be better
Resolved	kevinbazira	T240282 Improve WikiProject template --> WikiProject mapping

Event Timeline

Isaac created this task.Oct 28 2019, 5:54 PM

@Halfak Current output bzipped JSON is on stat1007 at /home/isaacj/drafttopic/full_wptemplates.json.bz2

bzless on command-line or basic Python3 code to access:

import json
import bz2

with bz2.open('/home/isaacj/drafttopic/full_wptemplates.json.bz2', 'rt') as fin:
  for line in fin:
    article_json = json.loads(line)
    ...

The script for generating this file is on stat1007 at /home/isaacj/drafttopic/build_dataset.py. I have some cleaning / commenting to do on the file, but can plan to create a figshare item with the dataset (based on 2019-10-01 dumps) and script unless you think it should be part of the drafttopic code directory. I'll also be working on formalizing the changes that I made to the draftproject pipeline so I can make a pull request against https://github.com/wikimedia/drafttopic

Fantastic. I'll dig into this.

Halfak added a parent task: T235187: Create labeled data for topic models in ar, cs, kowiki.Oct 28 2019, 6:21 PM

Halfak mentioned this in T235187: Create labeled data for topic models in ar, cs, kowiki.

Isaac added subscribers: MGerlach, diego.Oct 30 2019, 3:35 PM

@MMiller_WMF and his team have been looking into improvements to make to the directory structure as well.

I did a bit of work trying to clean up the directory. It's worse than I expected with the level of messy-ness. Some categories don't make any sense. In other cases, there's a lot missing from some categories.

See my work here: https://etherpad.wikimedia.org/p/wikiproject_directory_topic_labels

I think the best way to move forward is to publish a cleaned up version of this directory and to version it for others re-use. I spent an hour on this and I'm about 20% done. I think that this wouldn't take too much work to clean up.

@Halfak I didn't want to mess with the work you had done and in some cases didn't know how to merge my suggestions, so I just added them to the etherpad after your section. For posterity, here's what I had come up with from my own work. In general I think they complement your suggestions, but we might need some meeting time as a group at some point to merge everything.

What I added to the etherpad:

These are just my personal opinions / observations / experimentation based on working with this taxonomy and I'm happy to talk through any of the suggestions.

Changes that should be made directly to the WikiProjects Directory:

Remove the link labeled "Cities of the United States" that is here ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical#Cities_of_the_United_States ). The section linked to there currently is missing and, as a result, our parsing of the directory associates all North/South American cities/countries with the label "Cities of the United States". It can be fixed in the code too, but I don't the link makes sense from a reader perspective either.
Last I checked, the following WikiProjects appear to be missing (entirely or from additional sections they should be under) from the directory. I also provide my suggested place in the directory:
- WikiProject Awards and Prizes: History and Society -- History and Society
- WikiProject Historic Sites: Geography -- Parks, conservation areas, and historical sites
- WikiProject Brands: History and Society -- Business and economics
- WikiProject Marketing & Advertising: History and Society -- Business and Economics
- WikiProject Women Writers: Culture -- Biography
- WikiProject Fictional Characters: Culture -- Literature
- WikiProject Hugo and Nebula Award Winners: Culture -- Literature
- WikiProject Musicians: Culture -- Music
- WikiProject Women artists: Culture -- Arts
- WikiProject Women scientists: STEM -- Science
- I would argue that the following three WikiProjects should be moved from STEM.Biology to History and Society.History and society:
  - WikiProject Sexology and sexuality
  - WikiProject Gender Studies
  - WikiProject LGBT studies
- I also support breaking out Language and Literature into People and Language/Literature. I'd put at least the following WikiProjects under Biography:
  - WikiProject Biography
  - WikiProject Living People
  - WikiProject Royalty and Nobility
  - WikiProject Crime and Criminal Biography
  - WikiProject English Numeral Royalty Redirect
  - WikiProject Genealogy
  - WikiProject Hugo and Nebula Award Winners
  - WikiProject Leaders by year
  - WikiProject Musicians
  - WikiProject Peerage and Baronetage
  - WikiProject Persondata
  - WikiProject Royalty annd Nobility
  - WikiProject Saints
  - WikiProject Women artists
  - WikiProject Women scientists

The following WikiProjects appear in the lead section as opposed to under their own section and therefore do not have an associated Table of Contents section on the directory and are missed by our parsing:
- WikiProject Africa ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical/Africa )
- WikiProject Europe ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical/Europe )
- WikiProject Mediterranean ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical/Europe )
- WikiProject European Microstates ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical/Europe )
- WikiProject European Union ( https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical/Europe )

Changes that should be made to our derived taxonomy, but can reasonably be done via code:

I think we should toss out Geography.Countries and Geography.Cities -- all of the WikiProjects under these sections are currently associated with their respective regions (Americas, Asia, etc.) and the few WikiProjects whose data would be lost (WikiProject Cities, WikiProject Ghost Towns, WikiProject British Overseas Territories, WikiProject Countries, WikiProject Commonwealth) tend to duplicate the regions as well.
- I think that this has the side effect though of removing WikiProject Antartica, which should be mapped to Geography.Antarctica or a more generic Geography topic
I want to toss out all of the Assistance topics -- they aren't cohesive, do matter w/r/t whether the label is out-of-date or not, and don't apply for the vast majority of use cases.
I think we should remove WikiProject Commonwealth from the directory. It adds a ton of noise because the Brits colonized pretty much everywhere and thus countries in pretty much every continent get smeared together as opposed to associated with their respective Geography.<Continent> region.
All WikiProjects in Culture.Music also appear in Culture.Performing arts -- I'd remove them from Culture.Performing arts so as to avoid redundancy
All WikiProjects in STEM.Engineering also appear in STEM.Technology -- I'd remove them from STEM.Technology so as to avoid redundancy
All WikiProjects in Culture.Games and toys also appear in Culture.Entertainment -- I'd remove them from Culture.Entertainment so as to avoid redundancy
I'd associate WikiProject Museums with Culture.Arts instead of Culture.Plastic Arts (which tends to refer specifically to architecture)
Name changes:
- Geography.Parks, conservation areas, and. historical sites -> Geography.Parks
- Culture.Plastic arts -> Culture.Architecture

Things about which I am currently undecided but want to flag :)

I'm pretty sure the parser is not picking up the WikiProjects that are directly under this section because of its level: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Council/Directory/Geographical#Geography
- We can handle this in the code, but it's also a very weird topic that does not have much cohesiveness.
WikiProject Philippines also goes by Tamabyan Philippines and somewhere this is getting lost in the parsing

MMiller_WMF added a parent task: T227728: [EPIC] Growth: Newcomer tasks 1.0.Nov 6 2019, 12:18 AM

MMiller_WMF added a project: NewcomerTasks 1.1.

MMiller_WMF added subscribers: kostajh, Catrope, • marcella and 3 others.Nov 6 2019, 12:21 AM

@Halfak -- I've finished going through each topic using the Growth team's prototype and I've added detailed notes about them to the etherpad at the bottom. My notes contain impressions about how each topic is or is not performing well and example articles to support that. I also included a link to a spreadsheet showing my attempt at a mapping between the current ORES topics and the topics that the Growth team thinks generally cover the spaces of interest that users have. It shows where certain topics have too much or too little detail in ORES. Please let me know what you think or if you have any questions or reactions.

@Halfak -- I wanted to flag for you that @RHo has completed designs that we want to use to implement topic matching in newcomer tasks: https://wikimedia.invisionapp.com/share/S3UV941DUWM#/screens/393723066 (you can navigate with your arrow keys).

Screens 3, 4, and 5 show what we intend to implement with the first version. But on screen 6, we are showing a design that utilizes multiple levels of topics. We think that users would have a good experience with two levels instead of one, and so I wanted to bring that up here as you and @Isaac
work on the training. I know that the existing drafttopic model has levels, but they are probably too few and too broad to be useful: STEM, Geography, Culture, History and Society. I'm not sure what a better set of top level groupings would be, but I wanted to let you know that if we do end up with a good set, we have designs in which we would want to use them. Otherwise we'll want to use something like 20 to 30 topics just on one level (as shown in screens 3, 4, and 5).

Please let us know (or @marcella / @Catrope) know if you have any thoughts or questions.

I think having multiple levels makes a lot of sense and I'm interested in what a proposal for a better top-level grouping would be.

MMiller_WMF edited parent tasks, added: T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching); removed: T227728: [EPIC] Growth: Newcomer tasks 1.0.Nov 19 2019, 1:10 AM

MMiller_WMF mentioned this in T238610: Newcomer tasks: include topics in intro overlay.Nov 19 2019, 1:42 AM

I just shared a gdoc with @MMiller_WMF and @Isaac with a completed taxonomy. Once they have had a look at my work, I'll convert this to something machine readable and we can start experimenting with re-training the model.

Thanks @Halfak, this is awesome! I left a bunch of comments with the goal of trimming it down

I responded to a bunch of notes and I moved the data to a machine-readable format. See https://github.com/halfak/wikitax

Halfak mentioned this in T240273: Extract cross-wiki WikiProject tags.Dec 9 2019, 9:33 PM

I sent notes to @Halfak and @Isaac via email and in a spreadsheet. They sent notes back that I need to respond to.

Halfak added a subscriber: dr0ptp4kt.Dec 13 2019, 9:28 PM

Halfak added a project: Machine-Learning-Team.Dec 13 2019, 10:18 PM

Some follow-up to a conversation with @Halfak and @dr0ptp4kt :

This is how I have been adjusting the model outputs based on Wikidata properties in my Wikidata-based topic model, but the same adjustments would apply to ORES:

I create an additional topic "Compilation.List_Disambig" that includes List / Disambiguation articles. This is based on either the instance-of property (P31:Q4167410 for disambiguation pages; P31:Q13406463 for lists) or presence of the "is a list of" property (P360, which largely duplicates P31:Q13406463 but is kept for completeness).
If P625 (coordinate location) does not exist for a Wikidata item, I lowered the output confidence of any geography prediction by 0.5 (effectively removing them if the threshold is 0.5). This is still a big open question about how to handle articles tagged with Geography WikiProjects but that most people don't think of as Geography (e.g., famous people being tagged by the WikiProject for the state they were born)
I took any item with instance-of human (P31:Q5) and moved it into a Person topic and downgraded Culture.Language and Literature. This will be no longer necessary given the in-progress changes to the taxonomy!

The actual code that I use is here: https://github.com/geohci/wikidata-topic-model/blob/master/app/app.py#L30

On a related topic, if we're considering outputting Wikidata properties as part of an ORES prediction, I'd argue for at least the following:

Instance-of (P31) to cover the List/Disambiguation use-case and provide further data to help evaluate whether an article is about a person or not
Occupation (P106) to help further break down the Biography topic
Gender (P21) to help with filtering predictions around women scientists etc. where false positives can be particularly problematic
Coordinate location (P625) to help with evaluating geography predictions.
We could potentially consider an aggregator for country too but its design is not obvious -- e.g., outputting any countries listed under country (P17) for non-people, country-of-citizenship (P27) for people, or something more complex with place of birth (P19) or place of death (P20) for people as well.

Isaac closed subtask T240273: Extract cross-wiki WikiProject tags as Resolved.Dec 17 2019, 3:15 PM

@Halfak : the Python regexes that my NYU masters students developed for English/Hindi/Russian that might assist in preprocessing XML dump wikitext to model-ready tokens: https://github.com/mmarinated/topic-modeling/blob/master/baseline/data_creation/wiki_parser.py

They started with that wikifil.pl script as well for motivation and adapted it.

MMiller_WMF removed a subtask: T240276: Restructure WikiProject directory to be better.Dec 18 2019, 1:50 AM

MMiller_WMF added a subtask: T240286: Re-train English Wikipedia topic model using new WikiProject Taxonomy.Dec 18 2019, 1:53 AM

MMiller_WMF removed a subtask: T240282: Improve WikiProject template --> WikiProject mapping.

Halfak moved this task from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.Dec 18 2019, 8:42 PM

Halfak mentioned this in T241270: Add wikidata features to topic models.Dec 20 2019, 7:17 PM

Isaac moved this task from Backlog to In Progress on the Research board.Jan 7 2020, 3:32 PM

Halfak closed subtask T240286: Re-train English Wikipedia topic model using new WikiProject Taxonomy as Resolved.Jan 13 2020, 5:56 PM

MMiller_WMF edited parent tasks, added: T240517: [EPIC] Growth: Newcomer tasks 1.1.1 (ORES topics); removed: T238608: [EPIC] Growth: Newcomer tasks 1.1.0 (topic matching).Jan 22 2020, 5:45 AM

MMiller_WMF mentioned this in T244192: Newcomer tasks: ORES ontology mapping and score thresholds.Feb 4 2020, 12:38 AM

@Isaac looks like this is a straggler task so I'm resolving. Let me know if I'm missing something.

Halfak moved this task from Parked to Completed on the Machine-Learning-Team (Active Tasks) board.Feb 25 2020, 6:28 PM

@Halfak Sounds good to me. I almost resolved it a while back but decided that I still wanted to incorporate some of Morten's suggestions into our process for grabbing the list of articles and WikiProject templates. I'll open a new task to do that work though.

Awesome. Sounds good.

Improve drafttopic training data pipelineClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

What I added to the etherpad:

Improve drafttopic training data pipeline
Closed, ResolvedPublic
Actions

Related Objects
Search...