Produce regular, ongoing, concept-level navigation vectors. These vector embeddings have become a foundational tool in information retrieval, natural language processing, and artificial intelligence. However, researchers and practitioners now use vectors based on the content of Wikipedia articles. We will make navigation-based vectors available, which produce substantially higher-quality vectors for some applications. We will develop a robust, ongoing data pipeline to regularly produce effective navigation vectors.
We anticipate the following stages for the project:
Develop a spark process that ingests page views and translates them to page id sessions.(DONE - IN CODE REVIEW) Develop a job that extracts and creates the page id -> wikidata concept graph.(DONE - IN CODE REVIEW) Develop a second spark process that session page ids and translates them to Wikidata concepts.(DONE - IN CODE REVIEW) Develop a third spark process that ensures user privacy by redacting Wikidata concepts with too few views.(DONE - IN CODE REVIEW)
- Develop a process that runs the word2vec algorithm on the anonymized dataset.
- Conduct an analysis of parameter settings for the data pipeline underlying word2vec algorithm.
- Publish the vectors and promote them to the research and practitioner communities.