Page MenuHomePhabricator

Productionize navigation vectors
Open, Needs TriagePublic

Description

Produce regular, ongoing, concept-level navigation vectors. These vector embeddings have become a foundational tool in information retrieval, natural language processing, and artificial intelligence. However, researchers and practitioners now use vectors based on the content of Wikipedia articles. We will make navigation-based vectors available, which produce substantially higher-quality vectors for some applications. We will develop a robust, ongoing data pipeline to regularly produce effective navigation vectors.

We anticipate the following stages for the project:

  1. Develop a spark process that ingests page views and translates them to page id sessions. (DONE - IN CODE REVIEW)
  2. Develop a job that extracts and creates the page id -> wikidata concept graph. (DONE - IN CODE REVIEW)
  3. Develop a second spark process that session page ids and translates them to Wikidata concepts. (DONE - IN CODE REVIEW)
  4. Develop a third spark process that ensures user privacy by redacting Wikidata concepts with too few views. (DONE - IN CODE REVIEW)
  5. Develop a process that runs the word2vec algorithm on the anonymized dataset.
  6. Conduct an analysis of parameter settings for the data pipeline underlying word2vec algorithm.
  7. Publish the vectors and promote them to the research and practitioner communities.

Event Timeline

Halfak created this task.Sep 1 2017, 2:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2017, 2:23 PM
Halfak updated the task description. (Show Details)Sep 1 2017, 2:23 PM
JAllemandou moved this task from Incoming to Radar on the Analytics board.Sep 4 2017, 3:15 PM

Change 376797 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Placeholder for job to create page ids viewed in each session.

https://gerrit.wikimedia.org/r/376797

Change 377706 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Spark job to create session event log appears to be working.

https://gerrit.wikimedia.org/r/377706

Change 381169 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] WIP: Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/381169

Change 381517 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Simplified job to create session ids and finished debugging it.

https://gerrit.wikimedia.org/r/381517

Change 381517 abandoned by Shilad Sen:
Simplified job to create session ids and finished debugging it.

Reason:
These changes should be squashed with the previous committ...

https://gerrit.wikimedia.org/r/381517

Change 381169 abandoned by Shilad Sen:
WIP: Spark job to create page ids viewed in each session

Reason:
Superseded by later code review.

https://gerrit.wikimedia.org/r/381169

Change 383761 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/383761

Shilad updated the task description. (Show Details)Nov 28 2017, 3:47 AM