Productionize navigation vectors
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	Halfak
	Sep 1 2017, 2:23 PM

Description

Produce regular, ongoing, concept-level navigation vectors. These vector embeddings have become a foundational tool in information retrieval, natural language processing, and artificial intelligence. However, researchers and practitioners now use vectors based on the content of Wikipedia articles. We will make navigation-based vectors available, which produce substantially higher-quality vectors for some applications. We will develop a robust, ongoing data pipeline to regularly produce effective navigation vectors.

We anticipate the following stages for the project:

~~Develop a spark process that ingests page views and translates them to page id sessions.~~ (DONE - IN CODE REVIEW)
~~Develop a job that extracts and creates the page id -> wikidata concept graph.~~ (DONE - IN CODE REVIEW)
~~Develop a second spark process that session page ids and translates them to Wikidata concepts.~~ (DONE - IN CODE REVIEW)
~~Develop a third spark process that ensures user privacy by redacting Wikidata concepts with too few views.~~ (DONE - IN CODE REVIEW)
Develop a process that runs the word2vec algorithm on the anonymized dataset.
Conduct an analysis of parameter settings for the data pipeline underlying word2vec algorithm.
Publish the vectors and promote them to the research and practitioner communities.

Details

Subject	Repo	Branch	Lines +/-
Spark job to create desktop page ids viewed and searches performed in each session.	analytics/refinery/source	nav-vectors	+3 K -8
Spark job to create session event log appears to be working.	analytics/refinery/source	master	+547 -0
Placeholder for job to create page ids viewed in each session.	analytics/refinery/source	master	+327 -0
WIP: Spark job to create page ids viewed in each session	analytics/refinery/source	nav-vectors	+1 K -0
Simplified job to create session ids and finished debugging it.	analytics/refinery/source	nav-vectors	+146 -317

Customize query in gerrit

Related Objects

Mentioned In: T193751: Generate fresh set of navigation vectors
T161554: Provide large disk space to WikiBrain for memory-mapped file
T158972: Spark job to produce clickstream dataset
Mentioned Here: T242825: Deal with Google Chrome User-Agent deprecation

Event Timeline

Halfak created this task.Sep 1 2017, 2:23 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 1 2017, 2:23 PM

Halfak updated the task description. (Show Details)Sep 1 2017, 2:23 PM

Shilad mentioned this in T158972: Spark job to produce clickstream dataset .Sep 2 2017, 11:10 AM

JAllemandou moved this task from Incoming to Radar on the Analytics board.Sep 4 2017, 3:15 PM

• Tbayer subscribed.Sep 4 2017, 10:17 PM

Change 376797 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Placeholder for job to create page ids viewed in each session.

https://gerrit.wikimedia.org/r/376797

gerritbot added a project: Patch-For-Review.Sep 8 2017, 9:23 PM

Change 377706 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@master] Spark job to create session event log appears to be working.

https://gerrit.wikimedia.org/r/377706

Shilad updated the task description. (Show Details)Sep 13 2017, 4:50 AM

Shilad added a parent task: T158972: Spark job to produce clickstream dataset .

• Nuria removed a parent task: T158972: Spark job to produce clickstream dataset .Sep 13 2017, 2:46 PM

Change 381169 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] WIP: Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/381169

Change 381517 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Simplified job to create session ids and finished debugging it.

https://gerrit.wikimedia.org/r/381517

Change 381517 abandoned by Shilad Sen:
Simplified job to create session ids and finished debugging it.

Reason:
These changes should be squashed with the previous committ...

https://gerrit.wikimedia.org/r/381517

Change 381169 abandoned by Shilad Sen:
WIP: Spark job to create page ids viewed in each session

Reason:
Superseded by later code review.

https://gerrit.wikimedia.org/r/381169

Change 383761 had a related patch set uploaded (by Shilad Sen; owner: Shilad Sen):
[analytics/refinery/source@nav-vectors] Spark job to create page ids viewed in each session

https://gerrit.wikimedia.org/r/383761

Shilad updated the task description. (Show Details)Oct 25 2017, 2:25 AM

Shilad mentioned this in T161554: Provide large disk space to WikiBrain for memory-mapped file.

Shilad updated the task description. (Show Details)Nov 28 2017, 3:47 AM

• bmansurov mentioned this in T193751: Generate fresh set of navigation vectors.May 3 2018, 3:21 PM

@Shilad: Hi! Is this task still valid and should still be open (and its patches in Gerrit)? If yes, are you still working (or still plan to work) on this task?
If you do not plan to work on this task anymore, please remove yourself as assignee (via Add Action... → Assign / Claim in the dropdown menu) so in theory others could work on it. Thanks!

Hi @Aklapper, Thanks for asking! I think this got stuck in code review. I'm happy to step in and move it forward once folks have time to code review it.

Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM

MGerlach subscribed.Oct 23 2020, 10:34 AM

Boldly adding Data-Engineering as I'd love to know who could or should review and decide on the remaining three open patches in Gerrit.
Also adding #analytics-refinery per patch codebase (refinery).

odimitrijevic added a project: Data-Engineering-Radar.Feb 6 2022, 11:22 PM

Restricted Application removed a project: Data-Engineering. · View Herald TranscriptFeb 6 2022, 11:22 PM

Reasons for which I think this should be abandoned:

code is using an old version of spark and would need to be rewritten
future removal of user-agent will impact sessionization fingerprinting. New analysis of correctness will be needed (https://phabricator.wikimedia.org/T242825).
This research paper has shown that the clickstream dataset already provides most of the value.

Please reopen as needed.

Change 376797 abandoned by Joal:

[analytics/refinery/source@master] Placeholder for job to create page ids viewed in each session.

Reason:

https://phabricator.wikimedia.org/T174796

https://gerrit.wikimedia.org/r/376797

Change 377706 abandoned by Joal:

[analytics/refinery/source@master] Spark job to create session event log appears to be working.

Reason:

reason in https://phabricator.wikimedia.org/T174796

https://gerrit.wikimedia.org/r/377706

Change 383761 abandoned by Ottomata:

[analytics/refinery/source@nav-vectors] Spark job to create desktop page ids viewed and searches performed in each session.

Reason:

https://gerrit.wikimedia.org/r/383761

Productionize navigation vectorsClosed, DeclinedPublicActions

Description

Details

Related Objects

Event Timeline

Productionize navigation vectors
Closed, DeclinedPublic
Actions