investigate and document the Phoenix project, to understand what it could mean in light of our Sections, wikilinks QID as well as Entity Type work.
Phoenix is
- an WMF experimental service demonstrating the value of a structured content store
- chunking Wikipedia pages into sections, collecting Wikilinks, connects them to QIDs
- using Rosette (3rd party ML) to predict entity types based on QIDs
- Github: https://github.com/wikimedia/phoenix
Tasks
[X] We already do section parsing
[] Compare their sections to our sections
[] Compare their link parser to ours
[] Evaluate how they convert links into QIDs
[] Evaluate Rosette's entity types: https://www.rosette.com/capability/entity-extractor/#tech-specs, look into free API and their results. Evaluate what it would mean to have something similar running inhouse.
Resources:
- code to import Rosette is here: https://github.com/wikimedia/phoenix/tree/master/import
Deliverable:
A short report on the features of Phoenix Project as well as Rosette. Pros and Cons of their approach and if we can adopt some or all of their features.