Page MenuHomePhabricator

Generate datasets for sociative (vital 10k and some wikiprojects)
Closed, ResolvedPublic


Generate datasets for our collaborators to test their embedding methods against.

Get titles for all of the vital 10k articles.

Also, get titles for a couple of WikiProjects that represent different cross-sections of Wiki content. E.g. African diaspora and Women scientists.

Event Timeline

Halfak created this task.Apr 24 2019, 3:06 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2019, 3:06 PM

I just updated with a new dataset called "vital_10k_taxonomy.json" I'll be working on getting another dataset with pages that fall into a specific topic cross-section next.

Here are titles for African Diaspora (Hard mode):
Here are titles for a Women Scientists (Probably less hard):

That JSON is a bit mangled. I'll get a simple TSV of the titles instead.
African Diaspora:
Women Scientists:

Halfak closed this task as Resolved.Jun 18 2019, 1:39 PM
Halfak claimed this task.