Change Details

Produce regular, ongoing, concept-level [navigation vectors](https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors). These vector embeddings have become a foundational tool in information retrieval, natural language processing, and artificial intelligence. However, researchers and practitioners now use vectors based on the content of Wikipedia articles. We will make navigation-based vectors available, which produce substantially higher-quality vectors for some applications. We will develop a robust, ongoing data pipeline to regularly produce effective navigation vectors. We anticipate the following stages for the project: 1. Develop a spark process that ingests page views and translates them to Wikidata conceptpage id sessions. 2. Develop a seconjob that extracts and creates the page id -> wikidata concept graph. 1. Develop a second spark process that session page ids and translates them to Wikidata concepts. 2. Develop a third spark process that ensures user privacy by redacting Wikidata concepts with too few views. 3. Develop a process that runs the word2vec algorithm on the anonymized dataset. 4. Conduct an analysis of parameter settings for the data pipeline underlying word2vec algorithm. 5. Publish the vectors and promote them to the research and practitioner communities.