Page MenuHomePhabricator

Investigate features of Phoenix Project - timebox:3 days
Closed, ResolvedPublic5 Estimated Story Points

Description

investigate and document the Phoenix project, to understand what it could mean in light of our Sections, wikilinks QID as well as Entity Type work.

Phoenix is

  • an WMF experimental service demonstrating the value of a structured content store
  • chunking Wikipedia pages into sections, collecting Wikilinks, connects them to QIDs
  • using Rosette (3rd party ML) to predict entity types based on QIDs
  • Github: https://github.com/wikimedia/phoenix

Tasks

  • We already do section parsing
  • Compare their sections to our sections
  • Compare their link parser to ours
  • Evaluate how they convert links into QIDs
  • Evaluate Rosette's entity types: https://www.rosette.com/capability/entity-extractor/#tech-specs, look into free API and their results. Evaluate what it would mean to have something similar running inhouse.

Resources:

Deliverable:
A short report on the features of Phoenix Project as well as Rosette. Pros and Cons of their approach and if we can adopt some or all of their features.

Related Objects

StatusSubtypeAssignedTask
InvalidNone
OpenNone
OpenNone
ResolvedNone

Event Timeline

The requirement is to find all internal wikilinks on the page, then parse the child pages and extract all the Wikidata links (probably more appropriate to extract jut the right panel Wikidata links). Associate the child QID URLs with the wikilinks on the original page.

JArguello-WMF renamed this task from Investigate features of Phoenix Project to Investigate features of Phoenix Project - timebox:3 days.Oct 11 2023, 2:30 PM

If we aim to output a single "Entity Type" for an article, then Rosette will not help us.

Rosette takes a chunk of text and annotates it with Parts-of_Speech tags and a set of possible QIDs that may be related to each noun in the text chunk. So, it's creating many candidate QIDs for each word, it's not reducing this to a summary with one QID for the chunk of text. It's making no inference about the overall meaning of the chunk.

The aim of the Phoenix Project was to decouple sections/paragraphs from an article page and experiment with connecting these disconnected chunks using QIDs. Pheonix created a list of topic QID for a paragraph and saved them in ElasticSearch as index keys. Pheonix does not reduce the QIDs into a smaller set. It uses ElasticSearch to find "similar paragraphs" based on ElasticSearch-ranked results. The Pheonix Repo does not save the ElasticSearch config or field types they used for indexing, I assume they used a default index, although they could have benefited from a K-Nearest Neighbour algorithm for the paragraph similarity matching of QIDs, instead, it seems to be a simple single QID match.

Takeaways from the Phoenix Project: They decomposed articles into sections/paragraphs. They saved many QIDs for each paragraph into Elastic Search. They used GraphQL to "standardise" the access API for their clients. They did not do any extra logic to summarise QIDs into a single Entity Type.

Pheonix used an API (Rosette.com) to send text and get back POS and candidate QIDs for each noun. Rosette is a commercial API and our preference is to use open source tools. As an alternative to Rosette, there is the Spacy Python library that does the same and is highly regarded in the NLP community. I'd recommend Spacy over Rosette for future work on NLP, NER, POS tagging and QID entity linking (see code and output example in code block: https://spacy.io/usage/linguistic-features#entity-linking). It has a default training model that was trained on WMF articles and it can be fine-tuned by training it on other text. Also, it has a sibling project called "Prodigy" (https://demo.prodi.gy/?=null&view_id=ner_manual) that allows humans to build and edit knowledge graphs (similar to Wikidata, but on a smaller scale). Prodigy allows for human intervention and reinforced learning to improve QIDs and their relationship to keywords.

Of relevance for Entity Type tagging of WMF articles, is this demo: https://demo.prodi.gy/?=null&view_id=textcat_multi. It allows editors to label a document using RLHF (human-reinforced training)

Also, BertOpic will give us the main semantic topics for an article, we can then generate QIDs from these topic keywords, using Spacy Entity Linker: https://github.com/maartengr/bertopic

Potential classification system using Spacy: https://colab.research.google.com/github/wandb/examples/blob/master/colabs/spacy/SpaCy_v3_and_W%26B.ipynb#scrollTo=krVWm1YRFbHc

We'd need a training and validation data set to create a model with our curated entity type set