This task is done when a revscoring scorer model is trained and cross-validated that includes 100 hashed gram features.
108 features was discovered to be most effective in T128087
This task is done when a revscoring scorer model is trained and cross-validated that includes 100 hashed gram features.
108 features was discovered to be most effective in T128087
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T145812 Implement ~100 most important hash vector features in editquality models | |||
Resolved | Spike | Sabya | T128087 [Spike] Investigate HashingVectorizer |
So, I've been thinking that we might want to discover our high utility hash vector using a larger analysis of reverted edits and then use that to train a model on the damaging/goodfaith models.
In T128087, we used the highest "importance" hashes as learned by a GradientBoosting model. We could stick with that strategy or try out a TFiDF weight-selection strategy.