Page MenuHomePhabricator

Generate datasets for sociative (vital 10k and some wikiprojects)
Closed, ResolvedPublic

Description

Generate datasets for our collaborators to test their embedding methods against.

Get titles for all of the vital 10k articles.

Also, get titles for a couple of WikiProjects that represent different cross-sections of Wiki content. E.g. African diaspora and Women scientists.

Event Timeline

Halfak created this task.Apr 24 2019, 3:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2019, 3:06 PM

I just updated https://github.com/halfak/taxonomy_examples with a new dataset called "vital_10k_taxonomy.json" I'll be working on getting another dataset with pages that fall into a specific topic cross-section next.

Here are titles for African Diaspora (Hard mode): https://quarry.wmflabs.org/run/366495/output/0/json
Here are titles for a Women Scientists (Probably less hard): https://quarry.wmflabs.org/run/366489/output/0/json

That JSON is a bit mangled. I'll get a simple TSV of the titles instead.
African Diaspora: https://quarry.wmflabs.org/run/366495/output/0/tsv
Women Scientists: https://quarry.wmflabs.org/run/366489/output/0/tsv

Halfak closed this task as Resolved.Jun 18 2019, 1:39 PM
Halfak claimed this task.