Investigate features of Phoenix Project - timebox:3 days
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

None

Authored By

	ROdonnell-WMF
	Sep 18 2023, 5:53 PM

Description

investigate and document the Phoenix project, to understand what it could mean in light of our Sections, wikilinks QID as well as Entity Type work.

Phoenix is

an WMF experimental service demonstrating the value of a structured content store
chunking Wikipedia pages into sections, collecting Wikilinks, connects them to QIDs
using Rosette (3rd party ML) to predict entity types based on QIDs
Github: https://github.com/wikimedia/phoenix

Tasks

We already do section parsing
Compare their sections to our sections
Compare their link parser to ours
Evaluate how they convert links into QIDs
Evaluate Rosette's entity types: https://www.rosette.com/capability/entity-extractor/#tech-specs, look into free API and their results. Evaluate what it would mean to have something similar running inhouse.

Resources:

code to import Rosette is here: https://github.com/wikimedia/phoenix/tree/master/import

Deliverable:
A short report on the features of Phoenix Project as well as Rosette. Pros and Cons of their approach and if we can adopt some or all of their features.

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T350101 OKR 4.5
Open	None	T342102 {Machine Readability} Add EntityType
Open	None	T346247 {Machine Readability} Wikilink QIDs
Resolved	None	T346672 Investigate features of Phoenix Project - timebox:3 days

Event Timeline

ROdonnell-WMF created this task.Sep 18 2023, 5:53 PM

ROdonnell-WMF triaged this task as Low priority.Sep 21 2023, 11:12 AM

The requirement is to find all internal wikilinks on the page, then parse the child pages and extract all the Wikidata links (probably more appropriate to extract jut the right panel Wikidata links). Associate the child QID URLs with the wikilinks on the original page.

creynolds subscribed.Oct 3 2023, 11:53 PM

SDelbecque-WMF moved this task from Incoming to Engineering Backlog (DevOps, Maintenance, Tech debt) on the Wikimedia Enterprise board.Oct 11 2023, 12:00 PM

SDelbecque-WMF updated the task description. (Show Details)Oct 11 2023, 1:59 PM

SDelbecque-WMF moved this task from Engineering Backlog (DevOps, Maintenance, Tech debt) to To Be Estimated/To Be Discussed on the Wikimedia Enterprise board.Oct 11 2023, 2:06 PM

JArguello-WMF renamed this task from Investigate features of Phoenix Project to Investigate features of Phoenix Project - timebox:3 days.Oct 11 2023, 2:30 PM

SDelbecque-WMF added a parent task: T342102: {Machine Readability} Add EntityType.Oct 11 2023, 2:34 PM

JArguello-WMF set the point value for this task to 5.Oct 11 2023, 2:38 PM

JArguello-WMF moved this task from To Be Estimated/To Be Discussed to Estimated /Discussed on the Wikimedia Enterprise board.

If we aim to output a single "Entity Type" for an article, then Rosette will not help us.

Rosette takes a chunk of text and annotates it with Parts-of_Speech tags and a set of possible QIDs that may be related to each noun in the text chunk. So, it's creating many candidate QIDs for each word, it's not reducing this to a summary with one QID for the chunk of text. It's making no inference about the overall meaning of the chunk.

The aim of the Phoenix Project was to decouple sections/paragraphs from an article page and experiment with connecting these disconnected chunks using QIDs. Pheonix created a list of topic QID for a paragraph and saved them in ElasticSearch as index keys. Pheonix does not reduce the QIDs into a smaller set. It uses ElasticSearch to find "similar paragraphs" based on ElasticSearch-ranked results. The Pheonix Repo does not save the ElasticSearch config or field types they used for indexing, I assume they used a default index, although they could have benefited from a K-Nearest Neighbour algorithm for the paragraph similarity matching of QIDs, instead, it seems to be a simple single QID match.

Takeaways from the Phoenix Project: They decomposed articles into sections/paragraphs. They saved many QIDs for each paragraph into Elastic Search. They used GraphQL to "standardise" the access API for their clients. They did not do any extra logic to summarise QIDs into a single Entity Type.

Pheonix used an API (Rosette.com) to send text and get back POS and candidate QIDs for each noun. Rosette is a commercial API and our preference is to use open source tools. As an alternative to Rosette, there is the Spacy Python library that does the same and is highly regarded in the NLP community. I'd recommend Spacy over Rosette for future work on NLP, NER, POS tagging and QID entity linking (see code and output example in code block: https://spacy.io/usage/linguistic-features#entity-linking). It has a default training model that was trained on WMF articles and it can be fine-tuned by training it on other text. Also, it has a sibling project called "Prodigy" (https://demo.prodi.gy/?=null&view_id=ner_manual) that allows humans to build and edit knowledge graphs (similar to Wikidata, but on a smaller scale). Prodigy allows for human intervention and reinforced learning to improve QIDs and their relationship to keywords.

Of relevance for Entity Type tagging of WMF articles, is this demo: https://demo.prodi.gy/?=null&view_id=textcat_multi. It allows editors to label a document using RLHF (human-reinforced training)

Also, BertOpic will give us the main semantic topics for an article, we can then generate QIDs from these topic keywords, using Spacy Entity Linker: https://github.com/maartengr/bertopic

Potential classification system using Spacy: https://colab.research.google.com/github/wandb/examples/blob/master/colabs/spacy/SpaCy_v3_and_W%26B.ipynb#scrollTo=krVWm1YRFbHc

We'd need a training and validation data set to create a model with our curated entity type set

REsquito-WMF moved this task from Estimated /Discussed to Engineering Backlog (DevOps, Maintenance, Tech debt) on the Wikimedia Enterprise board.Oct 25 2023, 11:09 AM

REsquito-WMF removed a project: Wikimedia Enterprise Engineering.Oct 25 2023, 11:19 AM

REsquito-WMF closed this task as Resolved.Nov 21 2023, 12:29 PM

Investigate features of Phoenix Project - timebox:3 daysClosed, ResolvedPublic5 Estimated Story PointsActions

Description

Related ObjectsSearch...

Event Timeline

Investigate features of Phoenix Project - timebox:3 days
Closed, ResolvedPublic5 Estimated Story Points
Actions

Related Objects
Search...