Page MenuHomePhabricator

GSoC 2018 Proposal: [Wikipedia Search] Predict relevance of search results from historical clicks using a Neural Click Model
Open, Needs TriagePublic

Description

Profile Information

Name: Karan Dhingra
IRC: kdhingra307
Github: https://github.com/kdhingra307
Resume: karan_dhingra.pdf
Location: New Delhi, India
Working hours: 12:00 to 21:59 (UTC+5:30)

Synopsis

Click models are algorithmic approaches which help in the understanding relevance of documents over a given query by modeling the search queries in a particular fashion. Currently, Wikimedia Search uses Dynamic Bayesian Network[DBN] which is based on a probabilistic graphical model. An algorithmic model, Neural Click Model [NCM] has been proposed, which is not only more accurate than DBN but also provides a way to input semantic features apart from click data. This project is about implementing, testing and analyzing NCM verify if it provides any computational or prediction benefits to the current model and finally integrating with the Mjolnir library.

  • Mentors: @EBernhardson and @TJones
  • I have been discussing with the mentors and community about the approaches and methodologies which should be followed over phabricator and have provided different patches through external links.

Timeline

PeriodTask
23rd April - 14th MayRevising, fixing and refactoring existing code.
14th May - 21st MayProviding support to create and access 3d sparse matrices.
21st May - 28th MayDevelopment of core architecture.
28th May - 11th JuneImplementing Input Normalization function and tuning hyper parameters.
11th June - 15th JuneTesting on Machine Learning Ranking models (ndcg).
15th June - 21st JuneIntegration of the architecture in Mjolnir library
21st June - 30th JuneA/B testing on already available data for the majority of test parameters.
30th June - 14th JulyNormalization functions for probabilistic outputs to labels and for input search query
14th July - 21st JulyRetesting the whole model over single and multiple wikis.
21st July - 30th JulyDynamicity over SERP_SIZE
30th July - 5th AugustWiki, test cases and Documentation (if remaining) for the whole code base.
5th August - 1st SeptemberRetesting the whole model to analyze the performance improvements by varying SERP_SIZE.
  • Providing support to create and access 3d sparse matrices: This model requires input labels to be of 3 dimensions but scipy does not support 3D sparse matrices. So it requires the data to be stored in 2D, converted from sparse to dense batch by batch and then finally be transformed to 3D. The task would be to make this implicit so that learning engine can directly access input matrices without any transformation.
  • Implementing Input Normalization function and tuning hyper parameters: Approach would be to start tuning hyper parameters using grid search and work on normalization function while grid search is running in the background to save time. Normalization is required because while creating data for training, click counts are generated and they will vary with the time length.
  • Integration of the architecture in Mjolnir library: In this task, major work would be to write a wrapper that can convert input data from SQL to the 3D Sparse model and also integration with the data_pipeline.
  • A/B testing on already available data for the majority of test parameters: A/B testing has already been done last year over DBN model, so if we have data from last year then for most of the test cases we can compare the proposed model and DBN directly.
  • Dynamicity over SERP_SIZE: SERP_SIZE is the number of search results generated per page. It has very core interlinking in the input data, that's why I have taken this task last in the timeline, dynamicity of SERP_SIZE would enable us to have a generalized model and also we can predict optimal SERP_SIZE. It would require remodeling of input data and the way architecture is feeding to sequential units.

Participation

Since the original Mjolnir is present both on the GitHub and Gerrit, I can work on either of the platforms. I would prefer to fork the repository over Github and merge down my changes whenever my mentors approve them. Also, I can maintain a PR or add the mentors to the forked repository so that they can provide me insights and review the code.
Communication is a key aspect of the success of any project. I can use any platform to communicate which mentors would like to have, I am available on most of the platforms. For community bonding, I would be using phabricator to discuss with other members.
I would be following proper documentation style and coding standards(will be discussed with the mentors during community bonding period).

About Me

I am a final year undergraduate at the University of Delhi, studying "Information Technology and Mathematical Innovation". My majors are in mathematics and computer science. I was well aware of GSoC from last year but had prior commitments, this year I will be free from the first week of May which allows me to devote 100% of my time on this project. While applying to GSoC and choosing which project to work on, my focus was to work in my current domain of specialization (Human-Computer Interaction). Since the time I have been working on this project I have been attracted by the community and will be applying on for this.
During my last internship, I worked on similar kind of project to understand user behavior from their keystrokes dynamics using time series analysis only and I was very surprised with the results I got. With this project, I would get an opportunity to see the impact of similar kind of models on a massive scale.

Experiences

Keystrokes Dynamics Using Computational Intelligent Methods

The main purpose of this project was to understand user behavior and use it as a secondary measure of authentication, keystrokes are not much popular as an authentication measure because of accuracy and privacy concern. An architecture based on time series analysis of keystrokes and semantic analysis of user's input was designed.

My Contributions:

  • Designing the architecture for input of dynamic sequence length using GRU cells.
  • Semantic analysis using Multilayer Perceptron with a mathematical normalizer for ranking the vocabulary.

Emotion Recognition from Speech
This project involves analysis of speech using MFCCs and deep neural networks to predict the emotion over segment level.

My Contributions:

  • Developed a model to generate feature set of MFCCs from raw audio and then plugging to a deep neural network for feature training.
  • Designed a feedback loop for the auto training of hyperparameters.

Contributions

I have recently started contributing in open source and am handling large indices sparse matrices at scikit-learn, though it is not a major PR. I got introduced to Wikimedia and phabricator recently through this project and have created different patches.

All of the external patches have been shared in this thread.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2018, 8:14 AM
Kdhingra2210 updated the task description. (Show Details)

Thanks, @EBernhardson, @TJones, @srishakatux and other fellow community members for accepting my proposal.

ping @EBernhardson @TJones @Kdhingra2210 Is there anything remaining in this task from GSoC'18? If not, then please consider marking it as resolved! Also, want to check about the two patches open still that seem related to this project and waiting to be reviewed/merged. Are you considering to do follow-up work on these patches?

@srishakatux the model is complete but it is in testing over the complete data, so those patches and the project should be completed once the testing is over. It should take more than a week to finish those tests and I am currently working on those patches.

@Kdhingra2210: Any news to share about the status of this task? Thanks in advance!

Kdhingra2210 added a comment.EditedNov 5 2018, 2:35 PM

@Kdhingra2210: Any news to share about the status of this task? Thanks in advance!

Hi @Aklapper, it's done from my side, just waiting for integration in the mjolnir pipeline.