Click models are algorithmic approaches which help in the understanding relevance of documents over a given query by modeling the search queries in a particular fashion. Currently, Wikimedia Search uses Dynamic Bayesian Network[DBN] which is based on a probabilistic graphical model. An algorithmic model, Neural Click Model [NCM] has been proposed, which is not only more accurate than DBN but also provides a way to input semantic features apart from click data. This project is about implementing, testing and analyzing NCM verify if it provides any computational or prediction benefits to the current model and finally integrating with the Mjolnir library.
- Mentors: @EBernhardson and @TJones
- I have been discussing with the mentors and community about the approaches and methodologies which should be followed over phabricator and have provided different patches through external links.
|23rd April - 14th May||Revising, fixing and refactoring existing code.|
|14th May - 21st May||Providing support to create and access 3d sparse matrices.|
|21st May - 28th May||Development of core architecture.|
|28th May - 11th June||Implementing Input Normalization function and tuning hyper parameters.|
|11th June - 15th June||Testing on Machine Learning Ranking models (ndcg).|
|15th June - 21st June||Integration of the architecture in Mjolnir library|
|21st June - 30th June||A/B testing on already available data for the majority of test parameters.|
|30th June - 14th July||Normalization functions for probabilistic outputs to labels and for input search query|
|14th July - 21st July||Retesting the whole model over single and multiple wikis.|
|21st July - 30th July||Dynamicity over SERP_SIZE|
|30th July - 5th August||Wiki, test cases and Documentation (if remaining) for the whole code base.|
|5th August - 1st September||Retesting the whole model to analyze the performance improvements by varying SERP_SIZE.|
- Providing support to create and access 3d sparse matrices: This model requires input labels to be of 3 dimensions but scipy does not support 3D sparse matrices. So it requires the data to be stored in 2D, converted from sparse to dense batch by batch and then finally be transformed to 3D. The task would be to make this implicit so that learning engine can directly access input matrices without any transformation.
- Implementing Input Normalization function and tuning hyper parameters: Approach would be to start tuning hyper parameters using grid search and work on normalization function while grid search is running in the background to save time. Normalization is required because while creating data for training, click counts are generated and they will vary with the time length.
- Integration of the architecture in Mjolnir library: In this task, major work would be to write a wrapper that can convert input data from SQL to the 3D Sparse model and also integration with the data_pipeline.
- A/B testing on already available data for the majority of test parameters: A/B testing has already been done last year over DBN model, so if we have data from last year then for most of the test cases we can compare the proposed model and DBN directly.
- Dynamicity over SERP_SIZE: SERP_SIZE is the number of search results generated per page. It has very core interlinking in the input data, that's why I have taken this task last in the timeline, dynamicity of SERP_SIZE would enable us to have a generalized model and also we can predict optimal SERP_SIZE. It would require remodeling of input data and the way architecture is feeding to sequential units.
Since the original Mjolnir is present both on the GitHub and Gerrit, I can work on either of the platforms. I would prefer to fork the repository over Github and merge down my changes whenever my mentors approve them. Also, I can maintain a PR or add the mentors to the forked repository so that they can provide me insights and review the code.
Communication is a key aspect of the success of any project. I can use any platform to communicate which mentors would like to have, I am available on most of the platforms. For community bonding, I would be using phabricator to discuss with other members.
I would be following proper documentation style and coding standards(will be discussed with the mentors during community bonding period).
I am a final year undergraduate at the University of Delhi, studying "Information Technology and Mathematical Innovation". My majors are in mathematics and computer science. I was well aware of GSoC from last year but had prior commitments, this year I will be free from the first week of May which allows me to devote 100% of my time on this project. While applying to GSoC and choosing which project to work on, my focus was to work in my current domain of specialization (Human-Computer Interaction). Since the time I have been working on this project I have been attracted by the community and will be applying on for this.
During my last internship, I worked on similar kind of project to understand user behavior from their keystrokes dynamics using time series analysis only and I was very surprised with the results I got. With this project, I would get an opportunity to see the impact of similar kind of models on a massive scale.
Keystrokes Dynamics Using Computational Intelligent Methods
The main purpose of this project was to understand user behavior and use it as a secondary measure of authentication, keystrokes are not much popular as an authentication measure because of accuracy and privacy concern. An architecture based on time series analysis of keystrokes and semantic analysis of user's input was designed.
- Designing the architecture for input of dynamic sequence length using GRU cells.
- Semantic analysis using Multilayer Perceptron with a mathematical normalizer for ranking the vocabulary.
Emotion Recognition from Speech
This project involves analysis of speech using MFCCs and deep neural networks to predict the emotion over segment level.
- Developed a model to generate feature set of MFCCs from raw audio and then plugging to a deep neural network for feature training.
- Designed a feedback loop for the auto training of hyperparameters.
I have recently started contributing in open source and am handling large indices sparse matrices at scikit-learn, though it is not a major PR. I got introduced to Wikimedia and phabricator recently through this project and have created different patches.
All of the external patches have been shared in this thread.