Page MenuHomePhabricator

Setup test/train/validation splits for our training data.
Closed, ResolvedPublic

Description

There are several ways this could work, and I'm not sure yet what are the best ways. For reference we will have labeled data where combinations of (normalized query, page_id) all have the same label. Note though that the feature generation will not be done against the normalized query, it will be done against the original query prior to normalization. This is because while searches may look the same, a search for valerian and the city of may generate slightly different features than valerian city

Some things to keep in mind:

  • Data has been generally de-duplicated, but only by the original queries as provided by users.
  • Each query will have at least 10, 20, or more labeled data points. It seems plausible when doing a split of the data all information for a particular query should end up in the same bucket.
  • Does the above also apply to normlized versions of the queries? The DBN labels will be the same, so i think yes.

Potential problems:

  • Do we split based on the # of labels, # of unique queries, or # of normalized unique queries?

Some examples of queries that are different but all normalize to the same thing:

2016 MLB POSTSEASON
2016 MLB Postseason
2016 MLB postseason
2016 mlb postseason
2016 mlb  postseason
2016 MLB POStseason
CLASS TV
Class (TV)
Class TV
Class tv
class (tv)
class tv
class (tv
class TV
class  tv
Class  tv
class (TV)
Class Tv
class,tv
the class tv
cLASS tv
a nightmare in elm street
a nightmare on elm street
nightmare at the elm street
nightmare elm street
nightmare in the elm street
nightmare on elm street
Nightmare on Elm Street, A
Nightmare on elm's Street
a nightmare elm street
nightmare at elm street
nightmare of elm street
the nightmare at the elm street
/nightmare-on-elm-street
The nightmare in the Elm street
The nightmare on Elm Street
nightmare on elm streets
nightmare on the elm street
nightmare-on-elm-street-\
the nightmare on the elm street
a nightmare at elm street
nightmare in elm street
A nightmare of elm street
a nightmare elm  street
the nightmare of the elm street
Nightmare elm street
Nightmares on Elm Street
nightmare on elm's street
Nightmare at elm street
Nightmare in Elm Street
nightmare in the Elm street
a nightmare on elms street
the nightmare at elm street

Event Timeline

Also we may or may not need validation data, seems to depend on the library if it wants validation data separate from the test data or not. The general concept is that LambdaRANK is generating an ensemble of trees, so each tree is trained using the train set, then the ensemble of trees is tested against the test set. Once the score stops improving by generating more trees (or we hit the configured limit of # of trees) training stops, and final scores are generated against the validation set.

Another difficulty in the above is that it doesn't look like spark offers any kind of grouped splitting of data, We could perhaps crib some implementation from scikit-learn,: https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/model_selection/_split.py#L423

Certainly things are different when doing something across partitioned datasets, but perhaps we can split partitions by normalized query id and run the simple algorithm on each partition, hoping the final result is roughly balanced.