Page MenuHomePhabricator

Productionize navigation vectors
Closed, DeclinedPublic

Description

Produce regular, ongoing, concept-level navigation vectors. These vector embeddings have become a foundational tool in information retrieval, natural language processing, and artificial intelligence. However, researchers and practitioners now use vectors based on the content of Wikipedia articles. We will make navigation-based vectors available, which produce substantially higher-quality vectors for some applications. We will develop a robust, ongoing data pipeline to regularly produce effective navigation vectors.

We anticipate the following stages for the project:

  1. Develop a spark process that ingests page views and translates them to page id sessions. (DONE - IN CODE REVIEW)
  2. Develop a job that extracts and creates the page id -> wikidata concept graph. (DONE - IN CODE REVIEW)
  3. Develop a second spark process that session page ids and translates them to Wikidata concepts. (DONE - IN CODE REVIEW)
  4. Develop a third spark process that ensures user privacy by redacting Wikidata concepts with too few views. (DONE - IN CODE REVIEW)
  5. Develop a process that runs the word2vec algorithm on the anonymized dataset.
  6. Conduct an analysis of parameter settings for the data pipeline underlying word2vec algorithm.
  7. Publish the vectors and promote them to the research and practitioner communities.

Event Timeline

Change 376797 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Placeholder for job to create page ids viewed in each session.

https://gerrit.wikimedia.org/r/376797

Change 377706 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Spark job to create session event log appears to be working.

https://gerrit.wikimedia.org/r/377706

Change 381169 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] WIP: Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/381169

Change 381517 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Simplified job to create session ids and finished debugging it.

https://gerrit.wikimedia.org/r/381517

Change 381517 abandoned by Shilad Sen:
Simplified job to create session ids and finished debugging it.

Reason:
These changes should be squashed with the previous committ...

https://gerrit.wikimedia.org/r/381517

Change 381169 abandoned by Shilad Sen:
WIP: Spark job to create page ids viewed in each session

Reason:
Superseded by later code review.

https://gerrit.wikimedia.org/r/381169

Change 383761 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/383761

@Shilad: Hi! Is this task still valid and should still be open (and its patches in Gerrit)? If yes, are you still working (or still plan to work) on this task?
If you do not plan to work on this task anymore, please remove yourself as assignee (via Add Action...Assign / Claim in the dropdown menu) so in theory others could work on it. Thanks!

Hi @Aklapper, Thanks for asking! I think this got stuck in code review. I'm happy to step in and move it forward once folks have time to code review it.

Boldly adding Data-Engineering as I'd love to know who could or should review and decide on the remaining three open patches in Gerrit.
Also adding #analytics-refinery per patch codebase (refinery).

Reasons for which I think this should be abandoned:

  • code is using an old version of spark and would need to be rewritten
  • future removal of user-agent will impact sessionization fingerprinting. New analysis of correctness will be needed (https://phabricator.wikimedia.org/T242825).
  • This research paper has shown that the clickstream dataset already provides most of the value.

Please reopen as needed.

Change 376797 abandoned by Joal:

[analytics/refinery/source@master] Placeholder for job to create page ids viewed in each session.

Reason:

https://phabricator.wikimedia.org/T174796

https://gerrit.wikimedia.org/r/376797

Change 377706 abandoned by Joal:

[analytics/refinery/source@master] Spark job to create session event log appears to be working.

Reason:

reason in https://phabricator.wikimedia.org/T174796

https://gerrit.wikimedia.org/r/377706

Change 383761 abandoned by Ottomata:

[analytics/refinery/source@nav-vectors] Spark job to create desktop page ids viewed and searches performed in each session.

Reason:

https://gerrit.wikimedia.org/r/383761