Page MenuHomePhabricator

Multivariate logistic regression on search scores
Closed, ResolvedPublic

Description

Unlike T271799 having individual regression for each component. In this task, we’ll use all 15 components (title, caption, etc.) as input, fit a logistic regression with these scores as features and ratings as target variable.

It seems there are a lot sparse features, so we’ll check how to deal the logistic regression with sparse features.

Output will be information about the accuracy for the model with different metrics (f1 scores, balanced accuracy, etc) and the coefficients for the individual component.

Event Timeline

@Cparle we are thinking of normalizing the component scores before fitting the regression - is it possible to have Max and Min value (score) that each component can take? this would help make a normalization that is generalizable beyond the data you shared. Many thanks!

Min is zero, but there's actually no maximum - sorry! This is one of the things that makes elasticsearch difficult - scoring is on an open-ended scale

No prob, thanks @Cormac - then we probably should avoid normalization in this case, @Aiko?

@Miriam Yeah if there is no maximum, it's not appropriate to use normalization. I'll update the result of the non-normalization one.

Hi all,

Following is the result of logistic regression on search scores and some findings:

Since rating in the raw data has three values={1, 0, -1}, to do binary classification, rating=0 is considered as a bad match. In this case, I merged data with rating=0 into those with rating=-1.

Another way is removing the data which rating=0 whose proportion among the whole dataset is 0.15, then I used the remaining data (rating=1 and rating =-1) to fit a logistic regression.

Overall, the performance of the second version is better than the first one. I report both results as follows:

Model performance
  1. Data which rating=0 considered as a bad match
balanced accuracy: 0.6306
average precision score: 0.6506
brier score loss: 0.2153
f1 score: 0.5318
  1. Removing data which rating=0
balanced accuracy: 0.6624
average precision score: 0.7551
brier score loss: 0.2073
f1 score: 0.6293
Coefficients

For the two versions above, the trend is the same, and both have the 15th feature with the highest coefficient.

  1. Data which rating=0 considered as a bad match

Figure-

Coefficient-

array([-0.00722761,  0.021936  ,  0.04056063,  0.03559395,  0.02052407,
        0.02052407,  0.01702212,  0.00057635, -0.01129512,  0.        ,
       -0.01127191,  0.0138696 , -0.03131791,  0.02770065,  0.09263194])

Intercept-

-1.3638519776835112
  1. Removing data which rating=0

Figure-

Coefficient-

array([-0.02159655,  0.04977869,  0.05276279,  0.04668993,  0.02615154,
        0.02615154,  0.00636762,  0.01282327, -0.02016103,  0.        ,
       -0.02192281,  0.02003289, -0.04702737,  0.03902806,  0.10792615])

Intercept-

-1.172818154019393

Hi @AikoChou - it turns out that elasticsearch does not support negative weights, so we'll need to rework this so that all coefficients are positive (or zero). Could you maybe share your notebook so that I can copy what you've done and play around with it?

Hi @Cparle - yes of course, there you go:

Let me know if you need anything else :)