Profile Information
Name: Ansh Abhay Balde
IRC nickname: anshb or anshb_ (I am available on Freenode during my working hours)
GitHub Profile: ansh103
Email ID: anshabhaybalde@gmail.com
Web Profile: https://laymansguide.github.io
Location: Odisha, India
Typical working hours: Monday-Saturday 1000-1700 IST(UTC+05.30) and writing 'Weekly Blogs' on Sunday. Additionally, I will be free from 2200-0000 IST(UTC+05.30) for communication with mentors and subsequent implementation in the code. In case there is some extra work or an emergency arises and the work is not done on scheduled days then I would work on Sundays as well.
Synopsis
- Currently, WikiMedia’s MLR pipeline runs on DBN (Dynamic Bayesian Network) which is a continuation of traditional PGM (Probabilistic Graphical Models) based click models. NCM (Neural Click Model) was proposed two years ago in a research paper titled as 'A Neural Click Model for Web Search'. According to the paper, there’s a significant improvement in click prediction and relevance prediction tasks(on ndcg@1 and ndcg@3). WikiMedia has all the relevant data to train this model and finally test it out for generating labels for MLR pipeline. This project aims to test out NCM and find out whether it shows any improvement over DBN(current pipeline) as concluded in the research paper.
- Possible Mentors: @EBernhardson and @TJones
- I am in constant touch with them on Phabricator and IRC. With discussion, I could properly understand some of the complex elements of patch attached and also came to know about many of difficulties that may arise. The thread can be found at T186742.
Project Analysis
- Two Approaches of Click Models:
- PGM (Probabilistic Graphical Model): User behavior is represented as a sequence of observable and hidden events such as clicks, skips and document examinations. Now to understand in a better way, imagine an event as a directed graph. It has some dependencies associated with itself in form of directed edges. Now in PGM based click models, we have to set the structure of these dependencies manually. Different Models(UBM, DBN, CM etc) have different criteria. But all of them essentially miss a key element that is user behavior which drives us to move to a different approach.
- DR (Distributed Representation approach): Here user behavior is represented as vector states that capture user’s behavior in form of his/her information need and information consumed during the session. This lets this concept to dig in more complex patterns than binary events of PGMs. NCM is an approach based on this concept. This model learns similar concepts to those used in traditional click models and it also learns other concepts that can’t be designed manually.
- Neural Networks: Artificial Neural Networks form the basis of many modern day systems from image recognition systems, speech recognition etc. The basic idea is to emulate the behavior of our brain. “Teh wya we aer abel to raed thsi shwos hwo cpaabel our brains are” This is because of rigorous training that our neurons have undergone through. Same goes for ANNs. Increase the complexity of the networks and you get DNNs. CNNs, RNNs, and LSTMs are widely used DNNs. This matters to this project because we’ll be using LSTM to train on the click data we generate.
- MLR Pipeline on WikiMedia: The high-level overview of the pipeline is to assemble search queries with visited pages to generate labels used in a machine learning algorithm. The process is divided into many sections:- Generation of query clicks, training data, grouping queries, sampling, labels generation, generating feature vectors, evaluating feature importance and deploying the model to production.
Project Breakdown
The timeline mentioned in subsequent pages takes into account, a general approach to the model implementation, subsequent evaluation, and testing against existent models. The patch associated implements the model using Keras. We can surely continue using the same library for sake of simplicity and figure out the solutions around it or we can choose a different library altogether.
Let’s analyze each one of them:-
- Tensorflow:- This widely popular library by Google has a lot of potential and is certainly a good choice when you need more under the hood operations and want to have full control over your model. To be honest, it’s unfair to compare Keras with Tensorflow. Keras is a higher level abstraction of deep learning whereas Tensorflow is lower. Keras was used initially in the patch for the same reason (easier implementation and how powerful its models are)
- PyTorch:- Right now, Tensorflow has limited support for dynamic inputs in comparison to PyTorch. The intuitiveness of PyTorch makes it stand out. PyTorch is too a 'low level abstraction' but can be used as a framework too.
What should we use?
Tensorflow is a library with the advantage of visualization through tensorboard, large-scale distributed training(PyTorch has started supporting that now though) and deploying model for production.
Considering this GSoC project has implications as far as testing the potential of newer NCM model and not focused on actually deploying model on full scale(Tensorflow has an edge here). Simply saying PyTorch is made for doing research or even for production(if functional requirements are less). It provides a better debugging and developing experience with number of tools like pdb, ipdb etc to help. And last but not the least, its Python at its core. This blog really puts my points in a better perspective.
There are still many other factors to consider. It’s a choice that can be made either in the month of April or during the community bonding period. Either way, the timeline takes into account the fact that coding needs to be done again for the model(Keras/PyTorch/Tensorflow)
Timeline
There are three broad categories of all tasks and I'll use the respective keywords for each category of the task to be done in the timeline mentioned.
- Coding(C)
- Evaluation(E)
- Research(R)
| Time Period | Task | Description |
|---|---|---|
| Till April 22 | Pre GSoC Period | Working on the some of code cleanup tasks associated with the patch as entry tasks(C & R) and running some preliminary tests to get familiar with the codebase(E) |
| April 23 to May 13 | Community Bonding Period | Reviewing the discussion during the pre-GSoC period(R) and making the relevant code changes(C). Getting Familiar with existing DBN model, current MLR model, reviewing the research paper thoroughly and start laying the groundwork for the coding period of GSoC(R) |
| May 14 to May 20 | Preparing the input data(C) and research on training result sizes(R) | The data is in form of click logs. It needs to be processed into feature vectors. (This is already associated with the patch given in Phabricator thread so won’t take much time, therefore will start research on how size variation can affect evaluations of some common models) |
| May 21 to May 27 | Evaluation of the training results for various sizes(C & E) and research on hyperparameters(R & C) | Each observation that we have is for a single search session. So we can generate 10s/100s from millions of observations. Since training on a larger dataset would be an issue, trying different sets of sizes which are not very large. Hyperparameter tuning is the essence of the whole procedure of any machine learning concept. Reading relevant material associated with various parameters and laying down the code work for same. |
| May 28 to June 3 | Hyperparameter tuning(E) and Research on Normalization Procedures(R) | Evaluating results while varying various hyperparameters(network size/shape, no. and width of layers, activation functions etc) that were studied in the last week. Deciding which procedures to be explored to normalize. Because new datasets would have a different size than training dataset and it would perform poorly. Coding the relevant procedures. |
| June 4 to June 10 | Normalization(C & E) | Employing the normalization technique finalized in last week and evaluating the results for new datasets. |
| June 11 to June 15 | Phase 1 Evaluation | Code Cleanup and Final Submissions. |
| June 16 to June 17 | Kickoff Phase 2(R) | Writing a detailed blog what went right/wrong in Phase 1 and laying down the groundwork for phase 2. |
| June 18 to June 24 | Train MLR models(R & C) | Recalling the procedures discussed as part of community bonding period about current MLR model. This sums up the first assessment of learning-to-rank. Training MLR models using labels from the neural click model |
| June 25 to July 8 | Evaluation against DBN(R, C & E) | There are many options. Ruling out the crowdsourced labeling platform, there’s this survey which ran for all q+d pairs in the discernatron data and then trained a NN to predict the labels assigned by human judges from the survey data. Following that, we'll run a survey with queries we don't have labels for and then use the NN in predicting the relevance labels. This can be a basis for comparing the two models. |
| July 9 to July 13 | Phase 2 Evaluation | Code Cleanup and Final Submissions. |
| July 14 to 15 | Kickoff Final Phase(R) | Writing a detailed blog what went right/wrong in Phase 2 and laying down the groundwork for the final phase. |
| July 16 to July 29 | Running AB Tests and query normalization steps(R, C, and E) | Reading through relevant phabricator threads about current A/B Tests that were run and then running the full training pipeline to build models, and running AB tests with users with the different models. This will be time taking(15+days, this should finish before August 6 ideally) but we can compare these results with the earlier step to get a better idea of the whole model. After running the test, we can try different query normalization steps (like removing spaces, strip periods in acronyms etc)to group together queries than would be naturally possible. If these steps seem to go through seamlessly then will research on evaluation against wikis. |
| July 30 to August 5 | Buffer Week/ Extra Task | Can be used as buffer week in case any tasks are remaining or an extra task can be done. Try evaluating NCM against wikis. |
| August 6 to August 14 | Final Phase Submission | Submit final code work and final mentor evaluation. |
| August 22 | Final Results | After the final evaluation by mentors, results of Google Summer of Code 2018 would be announced |
Deliverables
- Working implementation of NCM model
- Comparison of results with DBN model using AB test/or the survey option mentioned in the timeline
- Extra goal(s): Application of the model on some of the wikis.
Participation
Daily progress will be updated on the project thread on Phabricator itself so that mentors can direct me accordingly. I'll be available for communication through freenode and email as well. I am comfortable with both Gerrit and GitHub.
A weekly blog would be published on Sunday from my GitHub page(link above) about my progress. Also, there would be a detailed analysis blog at the end of each phase(1,2,final) specifying achievements w.r.t set goals at the start of each phase.
About me
- Education(in progress): BTech Dual Degree in Electrical Engineering(3rd Year)
- About the Program: I heard about GSoC this year while going over the web for reputed open source programs. At the core of GSoC is this movement. WikiMedia Movement. Participating in this prestigious event for straight 12 years itself proves how important open sourcing solutions are to this organization. Add this to the fact that this organization had a project with same interests as mine. "Neural Networks"
- Any other commitments during the program: From April 23- April 30 I'll be having my University exams. April 30 onwards I'll be committed to GSoC just as a full-time job. Tentatively my University will be opening in last week of the coding period of GSoC(i.e start of August). Therefore my working hours would change to 1700-0100 (GMT+5.30) for that week only.
- About the Project: NCM (Neural Click Model) has been around for almost two years now. As far as I know, no one has tried it out yet on such a level as this project aims to. If the project results do match with the research paper’s conclusions then that would mean implementing it on WikiMedia later on. With my knowledge of Python, NNs and Information Retrieval (with the help of essential resources attached earlier), I hope to complete this Project in Google Summer of Code.
- Future plans after GSoC: The project I am associated with has much farther implications than currently known. This project is just a step towards it. After getting the relevant experience in 'Wikimedia's Discovery Search', I would definitely work to improve it and actively participate in discussions regarding the same.
Past Experience
- Prerequisites for the project (and their status)
- Python (Proficient)
- Neural Networks & Keras (took a regular course at University and did a project on load forecasting of hourly consumption using LSTM for my University. Heres the link.)
- Probability and Statistical Methods for a better understanding of research papers (was part of Mathematics-III coursework at University)
- General Understanding of Click Models (read a survey and went through codes to get an understanding of various PGM based click models)
- Research Paper on NCM (read the paper and understood almost all elements of it. Extended discussion can be found on phabricator thread)
- PySpark (Ongoing. I have started reading from online resources. Expected to complete before GSoC coding period)
- Mjolnir library of WikiMedia (Pending)
- The relevant code cleanup analysis associated with the patch can be found in later part of the proposal for the status of some of them so far. But here is the link where rest would be updated constantly.
- The relevant test results and analysis can be found here. This would be updated in accordance with different parameter changes.
- OSS contributions:- There are few documentation changes that I have done in some of the communities as starters which can be found on my GitHub profile. They are not significant to this project so not giving the links here. Apart from that, I have done few projects on NNs for the University. I'll upload them soon and update it here.
- WikiMedia Contributions:- This project is based on newer NCM model and is not employed on current MLR pipeline. So there aren't any sources to contribute to. But there is a lot to work on the patch associated with the Phabricator thread. And I have attached the relevant analysis files above.
Code Cleanup Analysis
Some of the code cleanup tasks were associated with the ncm.py file attached as a patch to the project thread on Phabricator.
After having a discussion with the mentor of project-EBernhardson, it was concluded that these code cleanup tasks have lower priority in agenda of the project. But at the same time, this can be some of the tasks to showcase on the project proposal for showing familiarity with code and ability to work in future. Some of them are to generalize while some are actually associated with the project subtasks. (for e.g activation functions’ evaluation etc) Here's an insight into each of the task and show the code for some while analyzing the others.
- Support n-dimensions:- Purpose is to generalise the code so that it is more likely to be correct. Solution: Here's an implementation of n dimensional sparse matrix (We can explore this for the patch ncm.py later on)
- Use of numpy arrays:- Purpose is to cut memory usage vs python list by 1/2. It might also allow faster copy operations from the vectors. Solution: Using the above implementation of n dimensional sparse matrix one can easily convert n dimensional sparse matrix into dense numpy array using todense() or we can first convert them to numpy arrays and create sparse matrices from numpy arrays using this implementation. There’s an alternate take here.
- Output a single file with all using np.savez:- Purpose: The X files and Y files generated need to be stored in a single file for purpose of looking cleaner. Problem: In a single .npz file there is significant data leakage between observations (observations of same query or results which share feature vectors due to aggregation) Solution: Currently there’s no workaround that I can think of. But this is not necessary for now so we’ll let it be the way it is for now.
- Verify conclusions from research paper match our data:- Purpose: Although we don't care that much about click prediction. I suppose though we could try and compare results from a real AB test with that of a test that uses predicted clicks. Solution: This is one of subtasks of GSoC coding period. This comparison can wait until some data is gathered up.
- Trying on a high core count machine:- Purpose: On EBernhardson’s(mentor) laptop (with only 2cpu) this is much slower. Solution: With workers=1, use_multiprocessing=True, results were found out. Detailed tests and analysis can be found here.
Any Other Info
Apart from the technical knowledge required for the project I have a keen interest in helping others by the means of sharing the knowledge I gain while discovering newer subjects of interests. That's why I started a site named as Layman's Guide for my fellow college mates and juniors to share my insights. It was made within a month using Gatsby. Its still in its initial stage(has no relevant articles yet) but I plan to update it every week in April as well as during the GSoC program. Here's the link to the code of the website.