Page MenuHomePhabricator

Predict relevance of search results from historical clicks using a Neural Click Model
Closed, DeclinedPublic

Description

https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/borisov-neural-2016.pdf

Came across a semi-recent paper proposing generation of relevance predictions from click data using neural networks. The findings show a respectable improvement over DBN (which we use today) for ndcg@1 and ndcg@3. We have all the necessary input data to train this model, it would be nice to build it out and evaluate it's performance generating labels for our MLR pipeline.

This task can be done without an NDA, although an engineer under NDA will have to first prepare the data. The prepared data would
be aggregated click counts and contain no query strings. There is a patch attached to this ticket that can generate data files from the click data. This task will require a reasonable amount of compute, so may require access to wmf analytics compute resources.

Event Timeline

abbreviations:

Likely this would be a couple months of work to do everything right. It might only take a few weeks though to do an initial proof of concept and validate the idea before really digging into it. Sub-tasks include (but not limited to):

  • Process the clicks logs into the feature vectors the NCM uses as input
  • Evaluate the training results for various sizes of input. Each observation provided is a single search session so we have the ability to generate 10's (perhaps low 100's) of millions of observations. Training on that much data is likely prohibitive so analysis will have to be done on the effects of varied training data sizes.
  • hyperparameter search, standard DNN stuff:
    • Evaluate network size/shape. Number and width of layers almost certainly depends on the data. Needs evaluation.
    • Evaluate activation functions.
    • There are of course more hyperparameters to tune, but the above seemed the most important. Still worth looking into the rest.
  • Evaluate how we would apply this to new data without re-training. As the feature vectors are aggregated values across the dataset it would, potentially, work poorly on a new dataset of a different size than the one trained on. Likely some form of normalization can improve the situation.
  • Train MLR models using labels from the neural click model. Evaluate how the results of these models compare to models trained on DBN data

Extras that would be nice:

  • Setup and run an AB test between models trained with DBN labels and NCM labels
  • The paper defines relevance as P(C=1|q,d) . We would need to convert this probability into an integer label for the MLR system it feeds into. Maybe stick with what we do for DBN (round(prob * 10)) but there might be better ways.
  • Evaluate training NCM against all wikis in one model, or training a separate model for each wiki. Depending on how much extra data improves the model this could be a bit complicated, as while we have tons of data for enwiki, something smaller like hewiki or viwiki will have orders of magnitude less training data.
  • For the DBN we apply a query normalization step that allows more queries to be grouped together than would be naively possible. This is a pretty expensive step (taking >1hr and keeping a 1k core elasticsearch cluster quite busy). Should evaluate if this benefits the NCM or if it's extra unnecessary work. Maybe a very simple normalization routine (drop spaces at begining/end, strip periods in acronyms like N.A.S.A., etc) would be enough?

Possible difficulties:

  • The training data is huge. The input vectors are 11x11265 (~124k) per observation. We have millions of observations. Thankfully the input data is very sparse, otherwise 10M observations would be ~1TB of training data.
  • There are no GPU's for training in the wikimedia analytics cluster where the click logs live. CPU training is of course possible, not sure on training speed differences between the two for this use case.
  • In terms of dispatchable compute, for training, our only option is currently a hadoop cluster. This probably means building on top of pyspark (or scala? but python is more popular for NN) for distributed training/hyperparameter search.

Change 409124 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[search/MjoLniR@master] [WIP] Neural Click Model

https://gerrit.wikimedia.org/r/409124

Attached patch is a small proof of concept, but to go from there to a full evaluation of NCM vs DBN is going to be a larger amount of work I can't find time for in the immediate future. This would be a great project for someone with interest in applying NN to practical problems.

@EBernhardson I might be wrong here, but considering the scope and complexity of this project, IMO this may not be a good fit for Google-Summer-of-Code (2018)

One of the things that surprised me by attending the GSOC mentor summit in 2016 was that the scope of our tasks was much more limited than some of the other orgs. For example this task i think is fairly comparable to Learning to rank Clickstream Mining at https://trac.xapian.org/wiki/GSoCProjectIdeas but is a bit more focused on a specific idea. I think we will have to be careful to only accept a student with enough background/experience to have a chance at finishing, but I don't think that needs to prevent students from applying. Overall i think this task can be succinctly described as "build the NN described in the paper and evaluate it's performance against an existing relevance prediction implementation" which seems quite reasonable as a task, the actual implementation of the net is made significantly easier by available libraries like keras, leaving most of the work in the evaluation.

Yes! That's a valid point, and there is much to learn from the resource you have shared. In the past, what we have seen is that if we have showcased a project which is still in the discussion phase, or is not quite feasible or has no mentors available, then it only frustrates our potential candidates. And, that’s why we have changed the strategy a bit: recruiting well-defined, easy to understand projects and mentors come first, and then promoting it. And, so I was trying to keep that in mind. With that said, I’ve added this project to our ideas list: https://www.mediawiki.org/wiki/Google_Summer_of_Code/2018#Predict_relevance_of_search_results_from_historical_clicks_using_a_Neural_Click_Model. Please help correct the description if required. Also, I am wondering if you would be able to suggest two mentors (including you perhaps, but not sure if you will have time :P) who would be willing to mentor for this project if a student shows interest.

Hi @EBernhardson , I would like to contribute for this project. This is an interesting work, though the computation is quite heavy and time demanding. I have been working previously on ranking properties by their interestingness ( interest a user has on a property over other ) for an entity in Wikidata. At the same time, I have experience in working with Deep Learning models. So this would be a good opportunity for me to contribute on this project.

@Vvekbv I think first steps for this project would be to review the paper and review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper. Once we think the data collection is correct i can run it against some time period and put a sample of data (probably saved as a scipy.sparse matrix, for loading into python) on https://analytics.wikimedia.org/datasets/discovery/ to start working with.

In terms of compute i think we can start with a relatively small set of data, get everything working and doing what we expect, then size up the data and evaluate how it performs and how things change as the data sizes increase.

EBjune triaged this task as Medium priority.Feb 22 2018, 6:23 PM

Hi @EBernhardson, I looked up this project from GSoC2018 and am interested in this, I have gone through the research paper and would like to work on this. I have already worked on time series models.

Can you help me with an entry point for this project?

@Kdhingra2210 Great! I'm not really sure what the best entry point is. Without review on the data collection we have to mostly hope the aggregation I wrote is correct. I still put up a sample of the available click data on the public archive[1]. The data is perhaps deceptively small as it compresses very well. Contained there are around 900k observations (search sessions) . The X matrix has an overall shape of (899310, 11, 11265) giving ~111 billion individual values. I wrote a README file that hopefully explains the data format. The data format can also be reviewed in the gerrit link above which contains the code used to collect these sparse matrices.

[1] https://analytics.wikimedia.org/datasets/discovery/ncm-agg-clicks/

Hi @EBernhardson
I am Ansh, a 3rd-year undergraduate student from India. Over the past week, I found my interests in Neural Networks matching with the project mentioned here. I have done “Load Forecasting Project using RNNs for my Institute’s hourly consumption of units” as a part of a project undertaken.
While going through the research article, I found that I need to understand the concepts of Information Retrieval on a better level. So I briefly went through this survey[1] which sums up all click models in a little more descriptive manner but at same time comprehensible enough. So it took quite a time to review all these material but I came to following conclusions:-

  1. Wikimedia is interested in this research article, because “neural click model” proposed here is faster on both click prediction task and relevance prediction tasks. At the same time, it adds an additional feature i.e it can learn concepts which can’t be designed manually.
  2. With that being said, we need to see whether that works really or not.
  3. In the starting point of the project discussion, “review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper.” was set as starting point of the project.
  4. That was not done, so it was assumed that datasets were generated correctly. And click data was made public with the attached Gerrit links.

Before dealing with the click data generated, I would take up the task of reviewing the patch provided. After that, I'll provide an update here.

[1] https://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf

This message is for students interested in working on this project for Google-Summer-of-Code (2018)

  • Student application deadline is March 27 16:00 UTC.
  • If you have questions about eligibility, please read the GSoC rules thoroughly here https://summerofcode.withgoogle.com/rules/. Wikimedia will not be responsible for verifying your eligibility and also not be able to make any decisions on this. For any clarifying questions, please email gsoc-support@google.com
  • Ensure that by now you have already discussed your implementation approach with your mentors, completed a few bugs/microtasks and made a plan to move forward with the proposal
  • I encourage you to start creating your proposals on Phabricator now to receive timely feedback on them from mentors. Do not wait until the last minute. Give your mentors at least a week's time to review your proposal, so that you could then incorporate any suggestions for changes. Learn how to submit a proposal in our participant's guide: https://www.mediawiki.org/wiki/Google_Summer_of_Code/Participants (Step 9)
  • Proposals that contain links to successfully merged patches before the application period and submitted on both Phabricator and GSoC portal will only be considered for the review process. So, between now and the application deadline, you could consider working on this task.
  • If you would like to chat with me more about the process or have questions, come and talk to me in the Zulip chat: https://wikimedia.zulipchat.com/

Also, I am wondering if you would be able to suggest two mentors (including you perhaps, but not sure if you will have time :P) who would be willing to mentor for this project if a student shows interest.

@EBernhardson has asked me to be the second mentor, so here I am!

Hi @EBernhardson
I am Ansh, a 3rd-year undergraduate student from India. Over the past week, I found my interests in Neural Networks matching with the project mentioned here. I have done “Load Forecasting Project using RNNs for my Institute’s hourly consumption of units” as a part of a project undertaken.
While going through the research article, I found that I need to understand the concepts of Information Retrieval on a better level. So I briefly went through this survey[1] which sums up all click models in a little more descriptive manner but at same time comprehensible enough.

Good initiative. The author of the book is also the author of the DBN implementation we used, so quite likely to be relevant. I looked over the introduction and it's worth noting that the book refers to click prediction / user simulation as the primary goal of a click models. That is likely true in general, but in this particular task we are more concerned with the ability of the model to predict relevance. The relevance predictions are then used as the target document orderings for an ML ranking function.

So it took quite a time to review all these material but I came to following conclusions:-

  1. Wikimedia is interested in this research article, because “neural click model” proposed here is faster on both click prediction task and relevance prediction tasks. At the same time, it adds an additional feature i.e it can learn concepts which can’t be designed manually.

We aren't as worried about speed of relevance prediction, the main goal is for the accuracy of that prediction. The hypothesis is basically that neural click models can generate more accurate predictions of relevance from click data than the DBN. The paper shows, for their dataset, an improvement on ndcg@1 from 0.717 to 0.775 and ndcg@3 from 0.725 to 0.773. If even half of that improvement is seen on our own data that could lead directly to improvements in search results. One difficulty for this task will be that the researchers had manual relevance judgements for 5k queries to calculate those ndcg@k scores, but we don't have a similar quantity of "ground truth" data. We will need to determine some method of comparing the relevance predictions,

  1. With that being said, we need to see whether that works really or not.

Exactly.

  1. In the starting point of the project discussion, “review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper.” was set as starting point of the project.

Correct. There is a reasonable chance the aggregation i wrote generates correct sparse matrices, but code could always use review.

  1. That was not done, so it was assumed that datasets were generated correctly. And click data was made public with the attached Gerrit links.

Also correct. It might also be worth noting this is only a sample of the full dataset.

Before dealing with the click data generated, I would take up the task of reviewing the patch provided. After that, I'll provide an update here.

[1] https://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf

Hi @EBernhardson
I am Ansh, a 3rd-year undergraduate student from India. Over the past week, I found my interests in Neural Networks matching with the project mentioned here. I have done “Load Forecasting Project using RNNs for my Institute’s hourly consumption of units” as a part of a project undertaken.
While going through the research article, I found that I need to understand the concepts of Information Retrieval on a better level. So I briefly went through this survey[1] which sums up all click models in a little more descriptive manner but at same time comprehensible enough.

Good initiative. The author of the book is also the author of the DBN implementation we used, so quite likely to be relevant. I looked over the introduction and it's worth noting that the book refers to click prediction / user simulation as the primary goal of a click models. That is likely true in general, but in this particular task we are more concerned with the ability of the model to predict relevance. The relevance predictions are then used as the target document orderings for an ML ranking function.

After having read the research article thoroughly, I did see how relevance prediction(which is what we are concerned about here in our project) was second part of their conclusions apart from click prediction tasks.

So it took quite a time to review all these material but I came to following conclusions:-

  1. Wikimedia is interested in this research article, because “neural click model” proposed here is faster on both click prediction task and relevance prediction tasks. At the same time, it adds an additional feature i.e it can learn concepts which can’t be designed manually.

We aren't as worried about speed of relevance prediction, the main goal is for the accuracy of that prediction. The hypothesis is basically that neural click models can generate more accurate predictions of relevance from click data than the DBN. The paper shows, for their dataset, an improvement on ndcg@1 from 0.717 to 0.775 and ndcg@3 from 0.725 to 0.773. If even half of that improvement is seen on our own data that could lead directly to improvements in search results. One difficulty for this task will be that the researchers had manual relevance judgments for 5k queries to calculate those ndcg@k scores, but we don't have a similar quantity of "ground truth" data. We will need to determine some method of comparing the relevance predictions,

My bad here. 5.2 of Research Article clearly shows how NCM performs better than other PGM based click models on ndcg@1 and ndcg@3. They had 41,275 query-document pairs to back their relevance prediction results. But we dont have it as suggested by you.
There are two ways to solve it(correct me if I am wrong):-

  1. Like in the research paper, the human generated binary relevance labels were used from Yandex's Relevance Prediction Set. This is an extravagant option. Russia's popular search engine had such options but here at WikiMedia we can't do that(assumption)
  2. Estimated relevance data[2]. The paper clearly shows how close calculations are for estimated relevance data and editorial relevance data. NDCG scores are evaluated over estimated, partial editorial data and combined dataset as well.
  1. With that being said, we need to see whether that works really or not.

Exactly.

  1. In the starting point of the project discussion, “review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper.” was set as starting point of the project.

Correct. There is a reasonable chance the aggregation i wrote generates correct sparse matrices, but code could always use review.

I am up for some of the todo tasks here. So I'll write codes according to this timeline and tasks with no timeline would be dealt with later:-

  1. support n-dimensions 17th March
  2. Use numpy arrays and resize? Would cut memory usage vs python list by 1/2. It might also allow faster copy operations from the vectors 18th March
  3. Output a single file with all using np.savez But then we have to dig (only a little) into the sparse matrix implementation
  4. Verify conclusions from research paper match our data. Although we don't care that much about click prediction. I suppose though we could try and compare results from a real AB test with that of a test that uses predicted clicks. See if they match up?
  5. Instead of these generators and excessive error handling, what if we copy the whole directory at the outset and use a contextmanager to clean it up?
  6. Evaluate with larger datasets line
  7. mask out the first element of time series, it is only the query and has nothing to predict line
  8. On my laptop (with only 2cpu) this is much slower, worth trying on a high core count machine though. workers=1, use_multiprocessing=True
  9. Mini batches/Memory Sized Batches More training data? Activation Functions Units reduce_input_dim
  1. That was not done, so it was assumed that datasets were generated correctly. And click data was made public with the attached Gerrit links.

Also correct. It might also be worth noting this is only a sample of the full dataset.

Before dealing with the click data generated, I would take up the task of reviewing the patch provided. After that, I'll provide an update here.

[1] https://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf

I had some doubts though:-

  1. Regarding Implementation of n-dimensional matrix:- I am not sure of use of pyspark in existing code. Maybe because I dont know about it much. Here's what I created[3] using numpy only
  2. Subtasks mentioned in the project are for GSoC's whole timeline or for now before the application period starts?

[2] https://zhangyuc.github.io/files/wang10cikm.pdf
[3] https://github.com/ansh103/Search_Engine_Click_Models/blob/master/sparse_n_dim.py

Hi @EBernhardson
I am Ansh, a 3rd-year undergraduate student from India. Over the past week, I found my interests in Neural Networks matching with the project mentioned here. I have done “Load Forecasting Project using RNNs for my Institute’s hourly consumption of units” as a part of a project undertaken.
While going through the research article, I found that I need to understand the concepts of Information Retrieval on a better level. So I briefly went through this survey[1] which sums up all click models in a little more descriptive manner but at same time comprehensible enough.

Good initiative. The author of the book is also the author of the DBN implementation we used, so quite likely to be relevant. I looked over the introduction and it's worth noting that the book refers to click prediction / user simulation as the primary goal of a click models. That is likely true in general, but in this particular task we are more concerned with the ability of the model to predict relevance. The relevance predictions are then used as the target document orderings for an ML ranking function.

After having read the research article thoroughly, I did see how relevance prediction(which is what we are concerned about here in our project) was second part of their conclusions apart from click prediction tasks.

So it took quite a time to review all these material but I came to following conclusions:-

  1. Wikimedia is interested in this research article, because “neural click model” proposed here is faster on both click prediction task and relevance prediction tasks. At the same time, it adds an additional feature i.e it can learn concepts which can’t be designed manually.

We aren't as worried about speed of relevance prediction, the main goal is for the accuracy of that prediction. The hypothesis is basically that neural click models can generate more accurate predictions of relevance from click data than the DBN. The paper shows, for their dataset, an improvement on ndcg@1 from 0.717 to 0.775 and ndcg@3 from 0.725 to 0.773. If even half of that improvement is seen on our own data that could lead directly to improvements in search results. One difficulty for this task will be that the researchers had manual relevance judgments for 5k queries to calculate those ndcg@k scores, but we don't have a similar quantity of "ground truth" data. We will need to determine some method of comparing the relevance predictions,

My bad here. 5.2 of Research Article clearly shows how NCM performs better than other PGM based click models on ndcg@1 and ndcg@3. They had 41,275 query-document pairs to back their relevance prediction results. But we dont have it as suggested by you.
There are two ways to solve it(correct me if I am wrong):-

  1. Like in the research paper, the human generated binary relevance labels were used from Yandex's Relevance Prediction Set. This is an extravagant option. Russia's popular search engine had such options but here at WikiMedia we can't do that(assumption)
  2. Estimated relevance data[2]. The paper clearly shows how close calculations are for estimated relevance data and editorial relevance data. NDCG scores are evaluated over estimated, partial editorial data and combined dataset as well.

Indeed 1 is too expensive. 2 is interesting, i hadn't seen this particular paper before. After reading I'm not sure how it solves the problem. My reading is that the primary contribution is a method to apply lambdamart to the probabilities of a PGM (GCM specifically, but others could be adapted) directly. It seems more like another potential avenue to investigate to improve click modeling, to be compared against DBN and NCM.

I've been thinking about this for some time as well and have a few ideas, but not sure how many will pan out:

  • We built a crowd sourced labeling platform and set it up at https://discernatron.wmflabs.org. Query + document pairs are labeled both by people on our team and volunteers we could convince. Unfortunately the work is tedious and not directly rewarding. We labeled around 150 queries with an average of 40 documents each this way. A significant number of these queries are from the long tail though, and are unlikely to show up with any frequency in new clickstream datasets. That makes this data too small to put much faith in to evaluate labels generated from click streams.
  • We've built out a survey that runs on article pages and asks the user a yes/no question about relevance of a query to that article. We ran this survey for all q+d pairs in the discernatron data and then trained a NN to predict the labels assigned by human judges from the survey data. This looks to work reasonably well, but time constraints have prevented me from moving forward from there. The next step will be to run a survey with queries we don't have labels for and then use the NN to predict the relevance labels. I'm optimistic we can use the output of this NN to evaluate DBN vs NCM but it's hard to say at this point.
  • Previously we have evaluated changes to the DBN by running the full training pipeline to build models, and run AB tests with users with the different models. This is a fairly definitive method but it is also time consuming. We typically run AB tests for 14 days, analysis of the results may take another day or two on top of that. 15+ days to get feedback is much too slow to iterate with.
  1. With that being said, we need to see whether that works really or not.

Exactly.

  1. In the starting point of the project discussion, “review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper.” was set as starting point of the project.

Correct. There is a reasonable chance the aggregation i wrote generates correct sparse matrices, but code could always use review.

I am up for some of the todo tasks here. So I'll write codes according to this timeline and tasks with no timeline would be dealt with later:-

Many of the TODO tasks on the data creation side of things are probably not too important for now. TODO's related to code clarity, optimizations, insufficiently abstracted code, etc. can be deferred until we prove with some level of certainty that NCM is comparable to or better than the DBN. If you want to provide patches for 1 or 2 of those items (such as you have already done for item 1) as part of the GSC application process i think that could only be beneficial to your application as a demonstration of your familiarity with concepts involved.

  1. support n-dimensions 17th March
  2. Use numpy arrays and resize? Would cut memory usage vs python list by 1/2. It might also allow faster copy operations from the vectors 18th March
  3. Output a single file with all using np.savez But then we have to dig (only a little) into the sparse matrix implementation
  4. Verify conclusions from research paper match our data. Although we don't care that much about click prediction. I suppose though we could try and compare results from a real AB test with that of a test that uses predicted clicks. See if they match up?
  5. Instead of these generators and excessive error handling, what if we copy the whole directory at the outset and use a contextmanager to clean it up?
  6. Evaluate with larger datasets line
  7. mask out the first element of time series, it is only the query and has nothing to predict line
  8. On my laptop (with only 2cpu) this is much slower, worth trying on a high core count machine though. workers=1, use_multiprocessing=True
  9. Mini batches/Memory Sized Batches More training data? Activation Functions Units reduce_input_dim
  1. That was not done, so it was assumed that datasets were generated correctly. And click data was made public with the attached Gerrit links.

Also correct. It might also be worth noting this is only a sample of the full dataset.

Before dealing with the click data generated, I would take up the task of reviewing the patch provided. After that, I'll provide an update here.

[1] https://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf

I had some doubts though:-

  1. Regarding Implementation of n-dimensional matrix:- I am not sure of use of pyspark in existing code. Maybe because I dont know about it much. Here's what I created[3] using numpy only

The n-dimensional matrix isn't super important. Right now the code is of course messy and has two separate implementations (SparseVectorBuilder and Csr3dBuilder) for building a sparse 3d array ontop of the 2d array provided by scipy. There isn't actually a use case (yet) for more than 3 dimensions, it just seemed the code might be more general, and more likely to be correct, if it was written for N dimensions.

I can certainly understand that the pyspark portion of the code is hard to follow. It's not clear from this code at all what the shape of the data read from the input_dir is, or what it represents. Without that information it's likely very difficult to evaluate if the aggregation is correct. I will try to find a way to at least document what the input data is to help make it clearer. In general spark is used here because the data is relatively large and spark allows applying significant amounts of parallelism to the task.

  1. Subtasks mentioned in the project are for GSoC's whole timeline or for now before the application period starts?

Subtasks are for the full timeline.

[2] https://zhangyuc.github.io/files/wang10cikm.pdf
[3] https://github.com/ansh103/Search_Engine_Click_Models/blob/master/sparse_n_dim.py

Hi @EBernhardson
I am Ansh, a 3rd-year undergraduate student from India. Over the past week, I found my interests in Neural Networks matching with the project mentioned here. I have done “Load Forecasting Project using RNNs for my Institute’s hourly consumption of units” as a part of a project undertaken.
While going through the research article, I found that I need to understand the concepts of Information Retrieval on a better level. So I briefly went through this survey[1] which sums up all click models in a little more descriptive manner but at same time comprehensible enough.

Good initiative. The author of the book is also the author of the DBN implementation we used, so quite likely to be relevant. I looked over the introduction and it's worth noting that the book refers to click prediction / user simulation as the primary goal of a click models. That is likely true in general, but in this particular task we are more concerned with the ability of the model to predict relevance. The relevance predictions are then used as the target document orderings for an ML ranking function.

After having read the research article thoroughly, I did see how relevance prediction(which is what we are concerned about here in our project) was second part of their conclusions apart from click prediction tasks.

So it took quite a time to review all these material but I came to following conclusions:-

  1. Wikimedia is interested in this research article, because “neural click model” proposed here is faster on both click prediction task and relevance prediction tasks. At the same time, it adds an additional feature i.e it can learn concepts which can’t be designed manually.

We aren't as worried about speed of relevance prediction, the main goal is for the accuracy of that prediction. The hypothesis is basically that neural click models can generate more accurate predictions of relevance from click data than the DBN. The paper shows, for their dataset, an improvement on ndcg@1 from 0.717 to 0.775 and ndcg@3 from 0.725 to 0.773. If even half of that improvement is seen on our own data that could lead directly to improvements in search results. One difficulty for this task will be that the researchers had manual relevance judgments for 5k queries to calculate those ndcg@k scores, but we don't have a similar quantity of "ground truth" data. We will need to determine some method of comparing the relevance predictions,

My bad here. 5.2 of Research Article clearly shows how NCM performs better than other PGM based click models on ndcg@1 and ndcg@3. They had 41,275 query-document pairs to back their relevance prediction results. But we dont have it as suggested by you.
There are two ways to solve it(correct me if I am wrong):-

  1. Like in the research paper, the human generated binary relevance labels were used from Yandex's Relevance Prediction Set. This is an extravagant option. Russia's popular search engine had such options but here at WikiMedia we can't do that(assumption)
  2. Estimated relevance data[2]. The paper clearly shows how close calculations are for estimated relevance data and editorial relevance data. NDCG scores are evaluated over estimated, partial editorial data and combined dataset as well.

Indeed 1 is too expensive. 2 is interesting, i hadn't seen this particular paper before. After reading I'm not sure how it solves the problem. My reading is that the primary contribution is a method to apply lambdamart to the probabilities of a PGM (GCM specifically, but others could be adapted) directly. It seems more like another potential avenue to investigate to improve click modeling, to be compared against DBN and NCM.

I've been thinking about this for some time as well and have a few ideas, but not sure how many will pan out:

  • We built a crowd sourced labeling platform and set it up at https://discernatron.wmflabs.org. Query + document pairs are labeled both by people on our team and volunteers we could convince. Unfortunately the work is tedious and not directly rewarding. We labeled around 150 queries with an average of 40 documents each this way. A significant number of these queries are from the long tail though, and are unlikely to show up with any frequency in new clickstream datasets. That makes this data too small to put much faith in to evaluate labels generated from click streams.
  • We've built out a survey that runs on article pages and asks the user a yes/no question about relevance of a query to that article. We ran this survey for all q+d pairs in the discernatron data and then trained a NN to predict the labels assigned by human judges from the survey data. This looks to work reasonably well, but time constraints have prevented me from moving forward from there. The next step will be to run a survey with queries we don't have labels for and then use the NN to predict the relevance labels. I'm optimistic we can use the output of this NN to evaluate DBN vs NCM but it's hard to say at this point.
  • Previously we have evaluated changes to the DBN by running the full training pipeline to build models, and run AB tests with users with the different models. This is a fairly definitive method but it is also time consuming. We typically run AB tests for 14 days, analysis of the results may take another day or two on top of that. 15+ days to get feedback is much too slow to iterate with.
  1. Crowdsourced labeling platform would be tedious indeed. And we can't guarantee the authenticity of volunteers. I gave one of my batchmates the same form. He started assigning random labels to each of result page of a particular query. The point being anyone will get annoyed in this tedious procedure.
  2. In comparison to previous one, surely this is a better situation given the fact that person has to answer yes or no question. Here's a doubt I have. If we are going to use this survey data to train NNs then having 4 options(irrelevant,maybe, probably and relevant) in survey(like https://discernatron.wmflabs.org)would be better than just yes or no. In this doubt, I may have misinterpreted you and actual survey had 4 options. Basically the NNs have been trained but they are yet to be tested. This can be done after we have completed setup of NCM in GSoC timeline. This can be a task before phase-1 evaluation so that we have enough time left to try out option 3(in case we fail) you mentioned(running the full training pipeline to build models, and run AB tests with users with the different models which takes 15 days roughly)
  1. With that being said, we need to see whether that works really or not.

Exactly.

  1. In the starting point of the project discussion, “review the aggregation performed in the patch associated with this ticket, to verify that the input vectors are generated correctly according to the paper.” was set as starting point of the project.

Correct. There is a reasonable chance the aggregation i wrote generates correct sparse matrices, but code could always use review.

I am up for some of the todo tasks here. So I'll write codes according to this timeline and tasks with no timeline would be dealt with later:-

Many of the TODO tasks on the data creation side of things are probably not too important for now. TODO's related to code clarity, optimizations, insufficiently abstracted code, etc. can be deferred until we prove with some level of certainty that NCM is comparable to or better than the DBN. If you want to provide patches for 1 or 2 of those items (such as you have already done for item 1) as part of the GSC application process i think that could only be beneficial to your application as a demonstration of your familiarity with concepts involved.

I'll try to solve some of them as a part of application with links to Github repo.
Since all this code cleanup has lower priority, I'll put other todo tasks on a later stage of GSoC timeline then?

  1. support n-dimensions 17th March
  2. Use numpy arrays and resize? Would cut memory usage vs python list by 1/2. It might also allow faster copy operations from the vectors 18th March
  3. Output a single file with all using np.savez But then we have to dig (only a little) into the sparse matrix implementation
  4. Verify conclusions from research paper match our data. Although we don't care that much about click prediction. I suppose though we could try and compare results from a real AB test with that of a test that uses predicted clicks. See if they match up?
  5. Instead of these generators and excessive error handling, what if we copy the whole directory at the outset and use a contextmanager to clean it up?
  6. Evaluate with larger datasets line
  7. mask out the first element of time series, it is only the query and has nothing to predict line
  8. On my laptop (with only 2cpu) this is much slower, worth trying on a high core count machine though. workers=1, use_multiprocessing=True
  9. Mini batches/Memory Sized Batches More training data? Activation Functions Units reduce_input_dim
  1. That was not done, so it was assumed that datasets were generated correctly. And click data was made public with the attached Gerrit links.

Also correct. It might also be worth noting this is only a sample of the full dataset.

Before dealing with the click data generated, I would take up the task of reviewing the patch provided. After that, I'll provide an update here.

[1] https://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf

I had some doubts though:-

  1. Regarding Implementation of n-dimensional matrix:- I am not sure of use of pyspark in existing code. Maybe because I dont know about it much. Here's what I created[3] using numpy only

The n-dimensional matrix isn't super important. Right now the code is of course messy and has two separate implementations (SparseVectorBuilder and Csr3dBuilder) for building a sparse 3d array ontop of the 2d array provided by scipy. There isn't actually a use case (yet) for more than 3 dimensions, it just seemed the code might be more general, and more likely to be correct, if it was written for N dimensions.

I can certainly understand that the pyspark portion of the code is hard to follow. It's not clear from this code at all what the shape of the data read from the input_dir is, or what it represents. Without that information it's likely very difficult to evaluate if the aggregation is correct. I will try to find a way to at least document what the input data is to help make it clearer. In general spark is used here because the data is relatively large and spark allows applying significant amounts of parallelism to the task.

I have started to grasp what is going on here and would be soon well acquainted with PySpark, essentially made for big data processing. Online resources[4] are helping a lot. Got to know how Python would be better way to go than Scala in our case(given the popularity with NNs as said by you in an initial comment). Understood why findspark was imported and have started reading about RDDs and dataframes. Alongwith with this guide and the information you'll provide later on, we can make progress.

  1. Subtasks mentioned in the project are for GSoC's whole timeline or for now before the application period starts?

Subtasks are for the full timeline.

[2] https://zhangyuc.github.io/files/wang10cikm.pdf
[3] https://github.com/ansh103/Search_Engine_Click_Models/blob/master/sparse_n_dim.py

Other things that I had in my mind regarding my Proposal:-
I was planning to attach a detailed analysis of our whole model(including all conclusions we are drawing in this Phabricator thread) and the way we'll proceed in this project. I'll post a link to that soon so that you can comment on it.
Apart from this, I am planning on getting the proposal draft ready by 20th/21st March. It won't be too late right?
[4] https://www.datacamp.com/community/tutorials/apache-spark-python

Here's the detailed analysis of the project. Proposal would be based on this.

@Kdhingra2210 Great! I'm not really sure what the best entry point is. Without review on the data collection we have to mostly hope the aggregation I wrote is correct. I still put up a sample of the available click data on the public archive[1]. The data is perhaps deceptively small as it compresses very well. Contained there are around 900k observations (search sessions) . The X matrix has an overall shape of (899310, 11, 11265) giving ~111 billion individual values. I wrote a README file that hopefully explains the data format. The data format can also be reviewed in the gerrit link above which contains the code used to collect these sparse matrices.

[1] https://analytics.wikimedia.org/datasets/discovery/ncm-agg-clicks/

@EBernhardson thanks for the reply, I have checked the Dynamic Bayesian Networks as well as NCM, apart from the performance, the network would be more simplified and central, I guess.

Also, I checked the patch u pushed to the Gerrit and there are some changes, I had done, Can I push them directly to gerrit or should I share you the github.

Also, I read about the need of conversion between various formats to support Keras, Are we limited to keras only??

@EBernhardson Can you share what results you got with that algorithm because I was getting fairly good confusion matrix.

Also, Is the shape of npz matrix files same as of SQL data,

you can find the test file with 5 epoch over different learning rates here:

https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb

I will update with extensive testing in few hours

  1. Crowdsourced labeling platform would be tedious indeed. And we can't guarantee the authenticity of volunteers. I gave one of my batchmates the same form. He started assigning random labels to each of result page of a particular query. The point being anyone will get annoyed in this tedious procedure.

For better or worse a significant portion of the ratings were done by people on our team, which provided an additional source of verification. There is also a reasonable body of literature out there for detecting bad-faith raters (and particularly hard tasks); we used multiple raters per task and Krippendorff's alpha. People getting paid to do tasks on Mechanical Turk cheat or click randomly, so there are many techniques for dealing with that. We had hoped that volunteers would have less incentive to cheat, but with enough data we thought we could sort it out. The unsolvable problem was the tedium; it's just so boring and annoying to do that we couldn't get enough data.

  1. In comparison to previous one, surely this is a better situation given the fact that person has to answer yes or no question. Here's a doubt I have. If we are going to use this survey data to train NNs then having 4 options(irrelevant,maybe, probably and relevant) in survey(like https://discernatron.wmflabs.org)would be better than just yes or no. In this doubt, I may have misinterpreted you and actual survey had 4 options.

We considered many options in terms of the wording of the question and the number of options to offer. We settled on three answers—yes, no, and something that counts as opting out, like "I don't know" (see T176428). We decided to keep it simple because the higher response volume means we can reasonably assign different levels of relevance based on the proportion of yes and no votes (and other factors in the NN). Anyone interested can see T171740: [Epic] Search Relevance: graded by humans for a lot more detail.

@Kdhingra2210 Great! I'm not really sure what the best entry point is. Without review on the data collection we have to mostly hope the aggregation I wrote is correct. I still put up a sample of the available click data on the public archive[1]. The data is perhaps deceptively small as it compresses very well. Contained there are around 900k observations (search sessions) . The X matrix has an overall shape of (899310, 11, 11265) giving ~111 billion individual values. I wrote a README file that hopefully explains the data format. The data format can also be reviewed in the gerrit link above which contains the code used to collect these sparse matrices.

[1] https://analytics.wikimedia.org/datasets/discovery/ncm-agg-clicks/

@EBernhardson thanks for the reply, I have checked the Dynamic Bayesian Networks as well as NCM, apart from the performance, the network would be more simplified and central, I guess.

Also, I checked the patch u pushed to the Gerrit and there are some changes, I had done, Can I push them directly to gerrit or should I share you the github.

If you remove the 'Change-Id' line from the patch and submit it to gerrit it will get assigned a new change-id and become a 'new' patch. Alternatively github is perfectly fine for the purposes of evaluating applications.

Also, I read about the need of conversion between various formats to support Keras, Are we limited to keras only??

The conversion is less about keras, and more about the ability of scipy.sparse in python to represent 3 dimensional matrices. The data needs to be stored in a sparse format to keep the size of the data reasonable, but scipy.sparse only supports 2 dimensional matrices. To feed this data into a NN lib (keras, theano, pytorch, etc.) it needs to be converted into the 3d format required by the model. So the conversion is mostly just getting the data from the efficient storage format into something keras can use.

In terms of keras specifically, it is not a requirement. I did the first draft with Keras because it was pretty straight forward. Anything in python, java or scala is acceptable. Other JVM languages could be considered, but generally I would like to stick to languages we already use.

@EBernhardson Can you share what results you got with that algorithm because I was getting fairly good confusion matrix.

The confusion matrices in your link look relatively similar to what I was getting. The confusion matrices are good for a quick look at how it's doing, but i think they sometimes make the results look better than they are. Precision, recall and F1 scores might be a reasonable way to look at the data. One of the strongest models in your link looks to be (roughly):

val_loss: 0.1629 - val_acc: 0.9404
[[255628   6692]
 [  6338  20592]]

This says in the ground truth there were 20592 + 6692 = 27284 clicks. The model predicted 26930 clicks.
This gives a few metrics:

  • precision: 20592 / 26940 = 0.764
  • recall: 20592 / 27284 = 0.754
  • We could calculate F[0.5, 1, 2] from that, but the numbers are pretty close so call it 0.76.

This is certainly not a bad baseline, but seems like there is plenty of room for improvement.

Also, Is the shape of npz matrix files same as of SQL data,

If by SQL data, you mean the input to the data collection, that data is structured with one row per page displayed to a user. The schema is roughly:

<wikiid: string, norm_query_id: int, session_id: string, hit_page_id: int, hit_position: int, clicked: boolean>
  • wikiid references a single site. For this initial evaluation data is limited to english wikipedia, but it would be interesting to pull data for other sites and evaluate
  • norm_query_id is an identifier that replaced text query strings, norm refers to a normalization process that groups together very similar textual queries under the same id.
  • session_id is also an identifier. All rows with the same session_id were performed by the same user within some timespan (i think timeout is 10 minutes between events).
  • hit_page_id is the page_id from mediawiki for the page displayed
  • hit_position is the position the result was displayed in
  • clicked is a boolean indicating if the user clicked this hit

The npz matrix files are the aggregation of the above data as described in the paper. I hope that answers your question?

you can find the test file with 5 epoch over different learning rates here:

https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb

I will update with extensive testing in few hours

  1. Crowdsourced labeling platform would be tedious indeed. And we can't guarantee the authenticity of volunteers. I gave one of my batchmates the same form. He started assigning random labels to each of result page of a particular query. The point being anyone will get annoyed in this tedious procedure.

For better or worse a significant portion of the ratings were done by people on our team, which provided an additional source of verification. There is also a reasonable body of literature out there for detecting bad-faith raters (and particularly hard tasks); we used multiple raters per task and Krippendorff's alpha. People getting paid to do tasks on Mechanical Turk cheat or click randomly, so there are many techniques for dealing with that. We had hoped that volunteers would have less incentive to cheat, but with enough data we thought we could sort it out. The unsolvable problem was the tedium; it's just so boring and annoying to do that we couldn't get enough data.

Thanks for clearing that out. Tedium is the only reason then.

  1. In comparison to previous one, surely this is a better situation given the fact that person has to answer yes or no question. Here's a doubt I have. If we are going to use this survey data to train NNs then having 4 options(irrelevant,maybe, probably and relevant) in survey(like https://discernatron.wmflabs.org)would be better than just yes or no. In this doubt, I may have misinterpreted you and actual survey had 4 options.

We considered many options in terms of the wording of the question and the number of options to offer. We settled on three answers—yes, no, and something that counts as opting out, like "I don't know" (see T176428). We decided to keep it simple because the higher response volume means we can reasonably assign different levels of relevance based on the proportion of yes and no votes (and other factors in the NN). Anyone interested can see T171740: [Epic] Search Relevance: graded by humans for a lot more detail.

Does make sense given the amount of volume. Will update once I go through the link.

@EBernhardson Can you share what results you got with that algorithm because I was getting fairly good confusion matrix.

The confusion matrices in your link look relatively similar to what I was getting. The confusion matrices are good for a quick look at how it's doing, but i think they sometimes make the results look better than they are. Precision, recall and F1 scores might be a reasonable way to look at the data. One of the strongest models in your link looks to be (roughly):

val_loss: 0.1629 - val_acc: 0.9404
[[255628   6692]
 [  6338  20592]]

This says in the ground truth there were 20592 + 6692 = 27284 clicks. The model predicted 26930 clicks.
This gives a few metrics:

  • precision: 20592 / 26940 = 0.764
  • recall: 20592 / 27284 = 0.754
  • We could calculate F[0.5, 1, 2] from that, but the numbers are pretty close so call it 0.76.

This is certainly not a bad baseline, but seems like there is plenty of room for improvement.

I thought it to be in the opposite, that's why I asked about confusion matrix of yours,

because if you see the F1 score for not predicting is very high (around 97%)

Also, Is the shape of npz matrix files same as of SQL data,

If by SQL data, you mean the input to the data collection, that data is structured with one row per page displayed to a user. The schema is roughly:

<wikiid: string, norm_query_id: int, session_id: string, hit_page_id: int, hit_position: int, clicked: boolean>
  • wikiid references a single site. For this initial evaluation data is limited to english wikipedia, but it would be interesting to pull data for other sites and evaluate
  • norm_query_id is an identifier that replaced text query strings, norm refers to a normalization process that groups together very similar textual queries under the same id.
  • session_id is also an identifier. All rows with the same session_id were performed by the same user within some timespan (i think timeout is 10 minutes between events).
  • hit_page_id is the page_id from mediawiki for the page displayed
  • hit_position is the position the result was displayed in
  • clicked is a boolean indicating if the user clicked this hit

gotcha, thanks (I was doubt about mapping earlier, because of the high negative F1 score, now it's clear)

The npz matrix files are the aggregation of the above data as described in the paper. I hope that answers your question?

you can find the test file with 5 epoch over different learning rates here:

https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb

I will update with extensive testing in few hours

yeah, basically I trained the data with low epochs to understand the relations between hyperparameters, Also I have found some relation between different learning rate, input_dim, and f1 score, but I am training it on 3 npz files (to ensure).

Also, for application over the different size of datasets, without retraining can be done using dynamic lstm cells (preferably gru), have worked over dynamic time series analysis and it provide quite better outputs but need relatively more training

@EBernhardson I ported this model to barebone tensorflow, though it doesn't support gridsearch or improve timedistributed dense functions, it does support dynamic indexing over the time dimension.

But on the core understanding of whole data (while transforming), this model (NCM) is single instance based (I meant earlier relations(t-1) should be represented using QD+Q+D feature vector only).

  • Evaluate how we would apply this to new data without re-training. As the feature vectors are aggregated values across the dataset it would, potentially, work poorly on a new dataset of a different size than the one trained on. Likely some form of normalization can improve the situation.

Are you referring to change in SERP_SIZE here, because if you were referring to change in SERP_SIZE, then I would need to know which Q and D field types you used and how you created the feature vector before applying the normalization?

Also, do you have any specific reason for plugging Q and D features along QD apart from accuracy?

Also, if the SERP_SIZE is going to be fixed and we are talking about dynamicity in the time dimension for each query (like Q0 -> Qn) then just check out the updated Jupyter notebook here (Its done but disabled because as such no relationships were found in the data).
(https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb)

Also, I have transferred the model in this file (https://s3.amazonaws.com/tripshire.staticfiles/model.py), Can you check for the data relations and also, instead of using a dense layer to merge down the lstm outputs, I used the last iteration ( looks more logical and output is relatively same too).

Also, for application over the different size of datasets, without retraining can be done using dynamic lstm cells (preferably gru), have worked over dynamic time series analysis and it provide quite better outputs but need relatively more training

Considering time would be an issue more or less, we would require to analyze which one would be better for this dataset. Here[1] it was concluded that both gating units perform better than traditional RNN which is pretty obvious. But the further conclusion said how LSTM performed better in one dataset whereas GRU performed better in another one. @EBernhardson could direct us from here.

[1] https://arxiv.org/pdf/1412.3555v1.pdf

@EBernhardson I ported this model to barebone tensorflow, though it doesn't support gridsearch or improve timedistributed dense functions, it does support dynamic indexing over the time dimension.

Tensorflow seems an interesting choice. I liked the usage of gru in the model. I had some extra things to add if do consider tensorflow as an option. Regarding hyperparameter tuning in tensorflow:-
The main reason of using gridsearch(human based) instead of hyperopt or something similar was the kind of time it would take, considering we have 500 models with 100x data than now. So how do you we do that in tensorflow? tf.contribute.learn has all classifiers that are compatible with scikit learn. We need to set explicitly, the environment variable TENSORFLOW_SKLEARN as 1 to get the required compatibility. Then provided it works we can easily use gridsearchcv. There are other options too which can be explored later if this fails.

  • Evaluate how we would apply this to new data without re-training. As the feature vectors are aggregated values across the dataset it would, potentially, work poorly on a new dataset of a different size than the one trained on. Likely some form of normalization can improve the situation.

Are you referring to change in SERP_SIZE here, because if you were referring to change in SERP_SIZE, then I would need to know which Q and D field types you used and how you created the feature vector before applying the normalization?

Not sure if this is the right file but here it is. Also what are your thoughts on normalization procedure? I will post a doc on it soon.

Also, for application over the different size of datasets, without retraining can be done using dynamic lstm cells (preferably gru), have worked over dynamic time series analysis and it provide quite better outputs but need relatively more training

Considering time would be an issue more or less, we would require to analyze which one would be better for this dataset. Here[1] it was concluded that both gating units perform better than traditional RNN which is pretty obvious. But the further conclusion said how LSTM performed better in one dataset whereas GRU performed better in another one. @EBernhardson could direct us from here.

[1] https://arxiv.org/pdf/1412.3555v1.pdf

Choosing between GRU and LSTM is not something to worry about, there are bigger things than this currently to think about.

@EBernhardson I ported this model to barebone tensorflow, though it doesn't support gridsearch or improve timedistributed dense functions, it does support dynamic indexing over the time dimension.

Tensorflow seems an interesting choice. I liked the usage of gru in the model. I had some extra things to add if do consider tensorflow as an option. Regarding hyperparameter tuning in tensorflow:-
The main reason of using gridsearch(human based) instead of hyperopt or something similar was the kind of time it would take, considering we have 500 models with 100x data than now. So how do you we do that in tensorflow? tf.contribute.learn has all classifiers that are compatible with scikit learn. We need to set explicitly, the environment variable TENSORFLOW_SKLEARN as 1 to get the required compatibility. Then provided it works we can easily use gridsearchcv. There are other options too which can be explored later if this fails.

I didn't integrate gridsearch, because I wrote this notebook to understand the algorithm and the approach in a depth. I used tensorflow because I feel more flexible tweaking submodules in tensorflow than other tools (personal preference).

Also personally, I would say gridsearch should be on lesser priority than integrating the dynamicity (because this can completely change the hyperparameters).I will be working first to integrate dynamicity over the required axis.

  • Evaluate how we would apply this to new data without re-training. As the feature vectors are aggregated values across the dataset it would, potentially, work poorly on a new dataset of a different size than the one trained on. Likely some form of normalization can improve the situation.

Are you referring to change in SERP_SIZE here, because if you were referring to change in SERP_SIZE, then I would need to know which Q and D field types you used and how you created the feature vector before applying the normalization?

Not sure if this is the right file but here it is. Also, what are your thoughts on normalization procedure? I will post a doc on it soon.

SERP_SIZE and normalization depend on whether we have to integrate dynamicity of SERP_SIZE (because I guess SERP_SIZE doesn't change that often), so I am pretty sure this is not what @EBernhardson meant in his comment.

@EBernhardson can you just explain further about your comment on dataset of different size.

@EBernhardson for 3d sparse matrices, I looked for multiple posts and found that tensorflow has sparse input and it supports N dimensions, so we can pipeline the input to tensorflow sparse directly.

you can check for the tensorflow sparse here [https://www.tensorflow.org/api_docs/python/tf/SparseTensor]
I guess it would be more optimized than a python to dense wrapper defined in pure python [or using scipy and transformation].

I am going to write a wrapper in the attached notebook that converts current CSR data to tensorflow one's to see the computational cost between dense and SparseTensor.

For conversion(from dense to sparse ones), I might try to use NdSparse matrix from xpspectre [https://github.com/xpspectre/Ndsparse]

If these prove to be good then we can write our own wrapper for this specific purpose only ( also, there is a possibility to remove the wrapper and directly convert to tensorflow sparse ones but that would require support from @TJones or @EBernhardson to reprocess the data)

The patch associated implements the model using Keras. We can surely continue using the same library for sake of simplicity and figure out the solutions around it or we can choose a different library altogether.
Let’s analyze each one of them:-
Tensorflow:- The widely popular library by Google itself has a lot of potential and is certainly good choice wherever you need more under the hood operations and want to have full control over your model. To be honest, it’s unfair to compare them. Keras is a higher level abstraction of deep learning whereas Tensorflow is lower. Keras was used initially in the patch for the same reason(easier implementation and how powerful it’s some of its models can be)
PyTorch:- Right now Tensorflow has limited support for dynamic inputs in comparison to PyTorch. I personally feel the intuitiveness of PyTorch makes it really stand out. PyTorch is too a low-level abstraction but with more feel of using it as a framework too.
What should we use?
Tensorflow is a library with its advantage in visualization through tensorboard, large-scale distributed training(PyTorch has started supporting that now though) and used to deploy model for production.
Considering this GSoC project has implications as far as testing the potential of newer NCM model and not focused on actually deploying model on full scale(Tensorflow has an edge here). Simply saying PyTorch is made for doing research or even for production(if functional requirements are less). It provides a better debugging and developing experience with no. of debugging tools like pdb, ipdb etc to help. And last but not least, its Python at its core.
This blog really puts my points in a better perspective.

Anyways it’s my way of seeing things. But there are many other factors still to consider. It’s a choice that will need @EBernhardson's opinion.

Useful links:-

  1. PyTorch Sparse- http://pytorch.org/docs/master/sparse.html
  2. On-the-fly Operation Batching- https://arxiv.org/pdf/1705.07860.pdf

@EBernhardson Can you share what results you got with that algorithm because I was getting fairly good confusion matrix.

The confusion matrices in your link look relatively similar to what I was getting. The confusion matrices are good for a quick look at how it's doing, but i think they sometimes make the results look better than they are. Precision, recall and F1 scores might be a reasonable way to look at the data. One of the strongest models in your link looks to be (roughly):

val_loss: 0.1629 - val_acc: 0.9404
[[255628   6692]
 [  6338  20592]]

This says in the ground truth there were 20592 + 6692 = 27284 clicks. The model predicted 26930 clicks.
This gives a few metrics:

  • precision: 20592 / 26940 = 0.764
  • recall: 20592 / 27284 = 0.754
  • We could calculate F[0.5, 1, 2] from that, but the numbers are pretty close so call it 0.76.

This is certainly not a bad baseline, but seems like there is plenty of room for improvement.

I thought it to be in the opposite, that's why I asked about confusion matrix of yours,

because if you see the F1 score for not predicting is very high (around 97%)

Both are going to be important, but in general our data is very unbalanced with an order of magnitude more 'not clicked' observations. A potential model that predicted everything as not clicked would have a recall of 1 and a precision of 0.91. Having such high scores for a poorly fit model suggests to me that that evaluating the ability to predict not-clicked isn't going to be a strong signal.

Also, Is the shape of npz matrix files same as of SQL data,

If by SQL data, you mean the input to the data collection, that data is structured with one row per page displayed to a user. The schema is roughly:

<wikiid: string, norm_query_id: int, session_id: string, hit_page_id: int, hit_position: int, clicked: boolean>
  • wikiid references a single site. For this initial evaluation data is limited to english wikipedia, but it would be interesting to pull data for other sites and evaluate
  • norm_query_id is an identifier that replaced text query strings, norm refers to a normalization process that groups together very similar textual queries under the same id.
  • session_id is also an identifier. All rows with the same session_id were performed by the same user within some timespan (i think timeout is 10 minutes between events).
  • hit_page_id is the page_id from mediawiki for the page displayed
  • hit_position is the position the result was displayed in
  • clicked is a boolean indicating if the user clicked this hit

gotcha, thanks (I was doubt about mapping earlier, because of the high negative F1 score, now it's clear)

The npz matrix files are the aggregation of the above data as described in the paper. I hope that answers your question?

you can find the test file with 5 epoch over different learning rates here:

https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb

I will update with extensive testing in few hours

yeah, basically I trained the data with low epochs to understand the relations between hyperparameters, Also I have found some relation between different learning rate, input_dim, and f1 score, but I am training it on 3 npz files (to ensure).

Also, for application over the different size of datasets, without retraining can be done using dynamic lstm cells (preferably gru), have worked over dynamic time series analysis and it provide quite better outputs but need relatively more training

This could be interesting to look at.

@EBernhardson I ported this model to barebone tensorflow, though it doesn't support gridsearch or improve timedistributed dense functions, it does support dynamic indexing over the time dimension.

But on the core understanding of whole data (while transforming), this model (NCM) is single instance based (I meant earlier relations(t-1) should be represented using QD+Q+D feature vector only).

  • Evaluate how we would apply this to new data without re-training. As the feature vectors are aggregated values across the dataset it would, potentially, work poorly on a new dataset of a different size than the one trained on. Likely some form of normalization can improve the situation.

Are you referring to change in SERP_SIZE here, because if you were referring to change in SERP_SIZE, then I would need to know which Q and D field types you used and how you created the feature vector before applying the normalization?

Due to how much SERP_SIZE changes the dimensions, i'm thinking that we need to stay pretty much at 10. For new data i was referring to the amount of time we aggregate the click data over. Within wikimedia we only keep the input data (click logs, queries, etc) necessary for the machine learned ranking for 90 days. What i was thinking about here is that if we train a model against 90 days of aggregated data, save it somewhere, and then later want to run new data against it, does that new data need to be for the same number of days? Many of the individual values of the feature vector are counts of clicks over the aggregation period so the scale of data might make a difference. I'm not sure this is the most important thing to figure out though, setting the aggregation period and keeping it constant is relatively easy.

Also, do you have any specific reason for plugging Q and D features along QD apart from accuracy?

Hmm, that may be a mistake in my aggregation. This code should be able to generate either Set 2 (QD+Q) or Set 3 (QD+Q+D) depending on the prediction_type parameter sent to the build_spark_dataset function. My reading is that both have Q, D, and I vectors. The difference is that in Set 3 the D vector is aggregated over all queries the document appears in, and in Set 2 the D vector is aggregated over only the query that matches the observation.

Also, if the SERP_SIZE is going to be fixed and we are talking about dynamicity in the time dimension for each query (like Q0 -> Qn) then just check out the updated Jupyter notebook here (Its done but disabled because as such no relationships were found in the data).
(https://nbviewer.jupyter.org/urls/s3.amazonaws.com/tripshire.staticfiles/ncm.ipynb)

I think SERP_SIZE changing would make things much too difficult, as that would also change the size of the individual observations.

Also, I have transferred the model in this file (https://s3.amazonaws.com/tripshire.staticfiles/model.py), Can you check for the data relations and also, instead of using a dense layer to merge down the lstm outputs, I used the last iteration ( looks more logical and output is relatively same too).

This looks reasonable to me.

@EBernhardson for 3d sparse matrices, I looked for multiple posts and found that tensorflow has sparse input and it supports N dimensions, so we can pipeline the input to tensorflow sparse directly.

you can check for the tensorflow sparse here [https://www.tensorflow.org/api_docs/python/tf/SparseTensor]
I guess it would be more optimized than a python to dense wrapper defined in pure python [or using scipy and transformation].

That looks excellent, seems worthwhile to utilize the direct support where available. That also hopefully means we don't have to re-implement per-epoch shuffling like i tried to do in the provided loader.

I am going to write a wrapper in the attached notebook that converts current CSR data to tensorflow one's to see the computational cost between dense and SparseTensor.

For conversion(from dense to sparse ones), I might try to use NdSparse matrix from xpspectre [https://github.com/xpspectre/Ndsparse]

If these prove to be good then we can write our own wrapper for this specific purpose only ( also, there is a possibility to remove the wrapper and directly convert to tensorflow sparse ones but that would require support from @TJones or @EBernhardson to reprocess the data)

I can certainly run a new aggregation of the data if we find a better way. Some script to convert scipy.sparse to the tensorflow format should be relatively short to evaluate that before hand.

The patch associated implements the model using Keras. We can surely continue using the same library for sake of simplicity and figure out the solutions around it or we can choose a different library altogether.
Let’s analyze each one of them:-
Tensorflow:- The widely popular library by Google itself has a lot of potential and is certainly good choice wherever you need more under the hood operations and want to have full control over your model. To be honest, it’s unfair to compare them. Keras is a higher level abstraction of deep learning whereas Tensorflow is lower. Keras was used initially in the patch for the same reason(easier implementation and how powerful it’s some of its models can be)
PyTorch:- Right now Tensorflow has limited support for dynamic inputs in comparison to PyTorch. I personally feel the intuitiveness of PyTorch makes it really stand out. PyTorch is too a low-level abstraction but with more feel of using it as a framework too.
What should we use?
Tensorflow is a library with its advantage in visualization through tensorboard, large-scale distributed training(PyTorch has started supporting that now though) and used to deploy model for production.
Considering this GSoC project has implications as far as testing the potential of newer NCM model and not focused on actually deploying model on full scale(Tensorflow has an edge here). Simply saying PyTorch is made for doing research or even for production(if functional requirements are less). It provides a better debugging and developing experience with no. of debugging tools like pdb, ipdb etc to help. And last but not least, its Python at its core.
This blog really puts my points in a better perspective.

Anyways it’s my way of seeing things. But there are many other factors still to consider. It’s a choice that will need @EBernhardson's opinion.

I'm open to the various python NN libraries. I'm a bit partial to the TF/Keras side of things, but working with one of the other libraries is acceptable as well.

Useful links:-

  1. PyTorch Sparse- http://pytorch.org/docs/master/sparse.html
  2. On-the-fly Operation Batching- https://arxiv.org/pdf/1705.07860.pdf

Due to how much SERP_SIZE changes the dimensions, i'm thinking that we need to stay pretty much at 10. For new data i was referring to the amount of time we aggregate the click data over. Within wikimedia we only keep the input data (click logs, queries, etc) necessary for the machine learned ranking for 90 days. What i was thinking about here is that if we train a model against 90 days of aggregated data, save it somewhere, and then later want to run new data against it, does that new data need to be for the same number of days? Many of the individual values of the feature vector are counts of clicks over the aggregation period so the scale of data might make a difference. I'm not sure this is the most important thing to figure out though, setting the aggregation period and keeping it constant is relatively easy.

Currently model support variable length of the input (in terms of days), Also we can save the trained model to work as a backup in case, we ever have to restart the whole model.

Ohhk, I understood your point of normalization the scale for click counts, we can see through this ( though it needs some discussion).

I think SERP_SIZE changing would make things much too difficult, as that would also change the size of the individual observations.

yeah, things will get tricky but it might be possible, would require designing some new architecture for Neural Click Model. I will see this on some open source available data because this sounds too interesting on its own.

I can certainly run a new aggregation of the data if we find a better way. Some script to convert scipy.sparse to the tensorflow format should be relatively short to evaluate that before hand.

I will design a proper wrapper to convert them initially and if they found to be promising then you can integrate to support the model out of the box with the library

Also, Can you tell me whether click logs from the First assessment of learning-to-rank are available or not because if those are available then we can test the majority of the A/B tests without redoing whole experiment?(Page Visit times, cannot be computed and I haven't been able to analyse the results from interleaved properly, those needs some insights from you).

I have tried plenty of approaches to maintain 3d sparse matrices, but none have worked out well atleast till now

SparseTensor is not yet supported by TensorFlow, I could have used it to support basic 3d sparse but it doesnt support slicing till now too,

There is an existing 3D COO matrix support from a library based on cython only (i guess) but its slicing is not as fast to beat that of scipy

Results showed it to be quite slow as compared to current methods,

Also, tensorflow doesn't support sparse matrices, so we would have to convert it to dense only.

I am thinking operating directly over indices and data, making batch dense array directly from data.

Also, @EBernhardson I have set up some normalization function for input dynamicity over the length of time period, As you know current data is of 90days period, can you recompile data over different time periods, so that I can test their performance, do required changes.

Sorry to take so long to reply, I've been traveling for conferences. Back now.

I can certainly run a new aggregation of the data if we find a better way. Some script to convert scipy.sparse to the tensorflow format should be relatively short to evaluate that before hand.

I will design a proper wrapper to convert them initially and if they found to be promising then you can integrate to support the model out of the box with the library

Also, Can you tell me whether click logs from the First assessment of learning-to-rank are available or not because if those are available then we can test the majority of the A/B tests without redoing whole experiment?(Page Visit times, cannot be computed and I haven't been able to analyse the results from interleaved properly, those needs some insights from you).

Click logs from that assessment will be gone, but we run new tests regularly that we could test against. We also have ongoing collection (although at relatively low %) which feeds into our dashboards and can also be used. Page visit times we can probably ignore, there would be no good way to estimate them. The interleaved we could attempt to simulate but it's probably not necessary. The interleaving we do essentially blends the results of two ranking functions (in a way such that a user that randomly clicks results in no preference between the two), and then records which ranking function receives the most clicks.

I have tried plenty of approaches to maintain 3d sparse matrices, but none have worked out well atleast till now

SparseTensor is not yet supported by TensorFlow, I could have used it to support basic 3d sparse but it doesnt support slicing till now too,

Interesting that they have a SparseTensor but it's not supported.

There is an existing 3D COO matrix support from a library based on cython only (i guess) but its slicing is not as fast to beat that of scipy

Results showed it to be quite slow as compared to current methods,

Also, tensorflow doesn't support sparse matrices, so we would have to convert it to dense only.

I am thinking operating directly over indices and data, making batch dense array directly from data.

This sounds reasonable to me, and is mostly what is happening inside of scipy.sparse. Since we already have to hack around a bit to make scipy.sparse work on 3d data this seems reasonable.

Also, @EBernhardson I have set up some normalization function for input dynamicity over the length of time period, As you know current data is of 90days period, can you recompile data over different time periods, so that I can test their performance, do required changes.

Sure, i'll pull some data for 30 and 60 day aggregations and we can compare the datasets.

Interesting that they have a SparseTensor but it's not supported.

Yepp, it has support just like normal sparse matrices but converting it to dense batches is very costly compared to scipy.sparse ones. (at least while computing using CPU)

Sure, i'll pull some data for 30 and 60 day aggregations and we can compare the datasets.

yep, I will look into these once uploaded

Congratulations @Kdhingra2210 on getting selected for this year's Google Summer of Code. It was a pleasure interacting with you, @EBernhardson, @TJones, and @srishakatux. I know I fell short in delivering expectable goals during the pre-GSoC period. Any advice from mentors, admin or @Kdhingra2210 is welcome so that I can apply with greater zeal next year.
My warm wishes to all of you for this year's Summer of Code. :)

Congratulations @Kdhingra2210 on getting selected for this year's Google Summer of Code.

Thank you very much!

@EBernhardson and @TJones during the past couple of weeks I have majorly worked in building different types of sparse matrices and understanding the research paper because there are certain minute details which can affect it largely.

I have merged down the tensorflow code in the main model only, just to start from somewhere.

you can run it using

mjolnir ncm -v

It should start training the model if everything works perfectly.

I am currently integrating 3d support in scipy, I should finish it by tomorrow and will run different tests on it after that.

Plus, I have developed a different script to test normalization of the matrices, I was thinking to merge it after testing out the results with the different time duration data.

@EBernhardson I did implement 3d sparse matrices in scipy too

but till now results aren't in our favor,

With scipy implementations, it was rendering the new matrices way faster than tensor and other pure python implementations of it but still, results proven to be slower than that of original (reshape approach).

I am going the patch, I implemented once again by today to see if I made some mistake in the patch.

Sounds good! Sorry for not being as responsive lately. If you ever need more real-time help I'm generally available M-F 1600-2400 UTC on freenode irc in #wikimedia-discovery. @TJones is similarly available but shifted 2 or 3 hours earlier.

I would have to double check how the reshape approach works under the hood, but it's entirely possible reshape doesn't change the data at all and mearly changes how the indexing calculations into the data happen. If there is no copy or even visiting of all the data to reshape then it makes sense that might be the fastest method.

@EBernhardson you were right, I made a mistake while defining an object in cython code

Now it is twice the faster as an older one

I only did one test, 3d took 9seconds while 2d_reshape took 23 seconds.
(I will share extensive results asap)

I will be more active on freenode during those hours

Now it is twice the faster as an older one

Cool!

My usual hours are M-F 1300-2100 UTC, though I can usually be available earlier or later with a little planning ahead.

@TJones and @EBernhardson

I was working on hyper tuning it with new sparse matrices last week, and then I found certain doubts about how data has been formatted

When can we discuss it??

I'm not sure what hours work best for you, but generally in the 1500-2000 UTC time of day is when i have meetings (to at least attempt to overlap with europe). We could do IRC if that works, just drop by whenever and say hello, or we can setup a hangout's call to talk about it as well. What works best for you? We could do hangouts maybe wed at 1500 UTC?

ping @EBernhardson @Kdhingra2210 @TJones Is there anything remaining in this task from GSoC'18? If not, then please consider marking it as resolved! If yes, and would need another GSoC or volunteer help then consider creating a new task with the leftover items.

@srishakatux yes, testing is ongoing and I think it should be finished in next week or so.

@srishakatux yes, testing is ongoing and I think it should be finished in next week or so.

@Kdhingra2210: Any news to share about the status of this task? Thanks in advance!

Change 409124 abandoned by EBernhardson:
[WIP] Neural Click Model

https://gerrit.wikimedia.org/r/409124

While interesting we didn't end up moving forward with this, there were just too many moving pieces and need for future maintenance for a production deployment