Page MenuHomePhabricator

Re-train English Wikipedia topic model using new WikiProject Taxonomy
Closed, ResolvedPublic


The goal is to have a taxonomy that more closely maps to newcomers' expectations.

We want to take the taxonomy from T240276: Restructure WikiProject directory to be better and apply to the current pipeline in rather than using the old taxonomy that was extracted from Wikipedia.

This task is done when: The new topic model is deployed @

Event Timeline

Halfak renamed this task from Re-train English Wikipedia topic model using new WikiProject Directory to Re-train English Wikipedia topic model using new WikiProject Taxonomy.Dec 9 2019, 10:30 PM

Using our old vectors, it looks like we're getting decent fitness. I've trained models on article text (most recent revision) and draft text (first revision) and I'm seeing similar fitness. There are a few topics that don't get good fitness.

The worst topics in the article topic model (PR-AUC):

Geography.Regions.Africa.Central Africa      0
Culture.Media.Software                       0.108
STEM.Mathematics                             0.123
Geography.Regions.Asia.North Asia            0.189
STEM.Physics                                 0.204
STEM.Computing                               0.214
History and Society.Society                  0.227
History and Society.History                  0.229
STEM.Technology                              0.251
Geography.Regions.Africa.Northern Africa     0.262
Geography.Regions.Africa.Eastern Africa      0.265
Culture.Media.Entertainment                  0.28

The word topics in the draft topic mode(PR-AUC):

Geography.Regions.Africa.Central Africa      0
Culture.Media.Software                       0.103
STEM.Physics                                 0.195
STEM.Computing                               0.203
Geography.Regions.Asia.North Asia            0.206
STEM.Mathematics                             0.218
History and Society.Society                  0.229
STEM.Technology                              0.236
Geography.Regions.Africa.Eastern Africa      0.237
History and Society.History                  0.25
STEM.Libraries & Information                 0.269
Culture.Media.Entertainment                  0.278
Geography.Regions.Africa.Northern Africa     0.296

There's a lot of overlap here. So I don't think it's a lack of text content that accounts for the badness of these cases. Let's see what we're training on.

Geography.Regions.Africa.Central Africa (A lack of labeled data)

$ bzcat datasets/enwiki.labeled_article_items.json.bz2 | grep "Geography.Regions.Africa.Central Africa" | json2tsv sitelinks.en | shuf | head
Marc Mbombo
Likati District
Democratic Republic of the Congo–Holy See relations
Rund Kanika
Red Cross of the Democratic Republic of the Congo
Bruno Tshibala
Tshibalabala Kadima
Kinshasa Democratic Republic of the Congo Temple
Lever Brothers

It looks like these are consistent, but we only have 18 observations total! It looks like the relevant WikiProjects aren't getting picked up.

We have the following WikiProjects. All of them are sub-projects of WikiProject Africa and do not use their own template. The rare exception is the small set of articles that use {{WikiProject Democratic Republic of the Congo}} which is a redirect to the {{WikiProject Africa}} template.

- WikiProject Angola
- WikiProject Cameroon 
- WikiProject Central African Republic
- WikiProject Chad
- WikiProject Democratic Republic of the Congo
- WikiProject Equatorial Guinea
- WikiProject Republic of the Congo
- WikiProject Gabon
- WikiProject Sao Tome and Principe

Culture.Media.Software (Category too wide?)

Random sample of articles:

Microsoft Windows version history
Editor war
Common Language Infrastructure
Grace Hopper

At a glance, these all seem relevant to software except maybe "Editor war". Maybe this category is just too wide.

STEM.Mathematics (Maybe exclude Crypto?)

Random sample of articles:

Rock paper scissors
Mathematical game
List of algorithms
Probability distribution
Pretty Good Privacy
Elliptic-curve cryptography
Normal distribution
Integer factorization

This looks pretty good. There's a lot of probability math and algorithms stuff in there. OpenSSH and PGP don't seem like a real "math" articles, but I would expect they get pulled in through WikiProject Cryptography. I wonder if we should exclude that WikiProject. Anyway, it's weird that we aren't really learning what "Mathematics" includes as this does seem to include a consistent set.

Geography.Regions.Asia.North Asia (Varied tagging practices)

Random sample of tagged articles:

North Ossetia–Alania
Boris Delibash
Lapland War
Grigory Zinoviev

It looks like there's a varied set of things that WikiProject Russia cares about that are not geographical in nature. E.g., the SVT-40 is a gun and Boris Delibash is a person.

STEM.Physics (Jargon isn't handled well)

Steven Weinberg
Inertial frame of reference
Chung-Yao Chao
Age of the Earth
Figure of the Earth
Sachs–Wolfe effect
Theory of relativity
Euclidean vector
Jens Martin Knudsen

Well, "Age of the Earth" is tagged by just about every WikiProject.

I wonder if there is something to do with Jargon being missing. I dug into the vectors we have and learned a few things.

  1. Capitalization matters. "Universe" is different from "universe".
  2. Specialized terms like "redshifted" don't exist.
  3. We use a vector of zeros (e.g. [0, 0, 0, ... 0]) when we don't have a word covered in the vectors

#3 is really bad because it means that articles that have words that aren't in the vectors get a lot of zero vectors averaged in. We should probably just not emit a vector if we can't look one up for a word!

I solved the zero vectors issue and retrained the models. Here's the difference:

                                             Old    New    Diff
Geography.Regions.Africa.Central Africa      0      0      0
Culture.Media.Software                       0.108  0.145  ++
STEM.Mathematics                             0.123  0.161  ++
Geography.Regions.Asia.North Asia            0.189  0.18   -
STEM.Computing                               0.214  0.19   --
STEM.Physics                                 0.204  0.232  ++
History and Society.Society                  0.227  0.233  +
History and Society.History                  0.229  0.237  +
Geography.Regions.Africa.Eastern Africa      0.265  0.247  --
STEM.Technology                              0.251  0.248  -
Culture.Media.Entertainment                  0.28   0.26   --
Geography.Regions.Africa.Northern Africa     0.262  0.267  +

It looks like we didn't really see any meaningful benefit. Bummer. I still have hope though that we'll be able to get better signal from the fasttext vectors that @kevinbazira is working on because they'll have vectors for deep jargon like "redshift" and "flourine".

We'll likely be able to deal with some of the geographic misses by pulling in geotag info.

I did some digging into the 50 cell models that Kevin trained. It looks like we do pick up "redshift" in the first 300k words:

[('redshifts', 0.967665433883667), ('redshifted', 0.9275557994842529), ('quasars', 0.9235901832580566), ('microlensing', 0.916486918926239), ('grbs', 0.9055891036987305), ('lensing', 0.8869045972824097), ('pulsars', 0.8805599808692932), ('gravitational', 0.8784142732620239), ('extrasolar', 0.8742184042930603), ('photometry', 0.873486340045929)]

But it doesn't look like we pick up "flourine"

>>> kv.similar_by_word('flourine')
KeyError: "word 'flourine' not in vocabulary"

I couldn't even get "flourine" with 500k words but I could get it if I loaded in the *entire model*.

>>> kv.similar_by_word('flourine')
[('crusts', 0.8145809769630432), ('triturated', 0.8126140832901001), ('carbonating', 0.8122193813323975), ('glaze', 0.8105583190917969), ('remineralized', 0.805204451084137), ('kaolin', 0.8033907413482666), ('flouring', 0.8021371960639954), ('bleaches', 0.7986146211624146), ('mordanted', 0.7982538342475891), ('macerated', 0.7974271774291992)]

This looks like an alright set of words.

Let's try the 100 cell models!

>>> kv.similar_by_word('redshift')
[('redshifts', 0.9580318927764893), ('redshifted', 0.9244788289070129), ('quasars', 0.8797826766967773), ('microlensing', 0.856326699256897), ('parsecs', 0.8424521684646606), ('kiloparsecs', 0.8395386934280396), ('grbs', 0.8378009796142578), ('pulsars', 0.8366929292678833), ('gravitational', 0.8292627930641174), ('bolometric', 0.8287577033042908)]

That's a pretty similar result. Let's try something a bit harder.

>>> kv.similar_by_word('permutations')
[('permutation', 0.959696352481842), ('permuting', 0.9091914892196655), ('integers', 0.8868318796157837), ('multiplications', 0.8791753053665161), ('multisets', 0.8739590048789978), ('vertices', 0.8692960739135742), ('permute', 0.8651570081710815), ('multiplicities', 0.8635972738265991), ('automorphisms', 0.8589743375778198), ('tuples', 0.8580431938171387)]
>>> kv.similar_by_word('combinatorics')
[('combinatorial', 0.9255411624908447), ('algebraic', 0.8925043344497681), ('mathematical', 0.8895950317382812), ('mathematic', 0.8747307062149048), ('computability', 0.8731629848480225), ('theorems', 0.8695327043533325), ('combinatory', 0.8691357374191284), ('theoretic', 0.8615963459014893), ('diophantine', 0.8591476678848267), ('metamathematics', 0.8570147752761841)]

Looks like we're getting good signal there.

I wonder how our old vectors performed.

>>> kv.similar_by_word('redshift')
KeyError: "word 'redshift' not in vocabulary"
>>> kv.similar_by_word('factorial')
KeyError: "word 'factorial' not in vocabulary"
>>> kv.similar_by_word('permutations')
[('combinations', 0.5920488238334656), ('variables', 0.4811089038848877), ('possibilities', 0.47414517402648926), ('variations', 0.4657440185546875), ('scenarios', 0.46018773317337036), ('equations', 0.43501317501068115), ('twists', 0.4297747313976288), ('guises', 0.427018404006958), ('paradoxes', 0.4257811903953552), ('ifs', 0.41084015369415283)]
>>> kv.similar_by_word('combinatorics')
KeyError: "word 'combinatorics' not in vocabulary"

Aha! It looks like our current model is pretty bad. That's good news. Seems like we can get a lot of signal from the new vectors.

Regretfully, I was not able to get better signal from the new vectors. I'm not quite sure why. For this work, I looked at the aggregate PR_AUC measure, but the fitness for the topics I reference above correlates pretty strongly with the overal PR_AUC averages.

  • word2vec: gnews 300 cell -
    • pr_auc (micro=0.709, macro=0.548)
  • fasttext: wikipedia 50 cell skipgram (1.9GB model with a ton of estimators and depth)
    • pr_auc (micro=0.617, macro=0.435)
  • fasttext: wikipedia 50 cell cbow
    • pr_auc (micro=0.545, macro=0.351)
  • fasttext: wikipedia 100 cell skipgram
    • pr_auc (micro=0.588, macro=0.407)
  • fasttext: wikipedia 200 cell skipgram
    • pr_auc (micro=0.598, macro=0.421)

Generally, the word2vec-based vectors from google news perform better than fasttext. I tried doing some hyperparameter optimization and it looks like the model gets better fitness with more estimators and more depth, but the model gets huge. See the note for the "wikipedia 50 cell skipgram".

So, my plan for now is to deploy the gnews 300 cell model ASAP and continue explorations here longer-term.