Page MenuHomePhabricator

Implement ~100 most important hash vector features in editquality models
Open, LowPublic

Description

This task is done when a revscoring scorer model is trained and cross-validated that includes 100 hashed gram features.

108 features was discovered to be most effective in T128087

Event Timeline

So, I've been thinking that we might want to discover our high utility hash vector using a larger analysis of reverted edits and then use that to train a model on the damaging/goodfaith models.

In T128087, we used the highest "importance" hashes as learned by a GradientBoosting model. We could stick with that strategy or try out a TFiDF weight-selection strategy.