Page MenuHomePhabricator

Incorporate in new HTML features to quality model
Open, Needs TriagePublic

Description

Once we have made the switch from wikitext-based features to HTML-features, we can also start thinking about what additional features we might incorporate into the model! There are three easy ones:

  • Number of unique sources
  • Presence of infobox
  • Presence of messagebox

Others?

Event Timeline

Thanks @DJames-WMF for your first pass (notebook)! Next steps now that we have some initial findings:

  • You're extracting message-boxes/infoboxes correctly but you'll update the code to make sure they're retained in the final model as well.
  • You can trim out some of the old summary at the top of the notebook to focus on just this expanded model and its coefficients (with a link back to Copy-2 where folks can find more details about the earlier iterations).
  • What to do about this mysterious negative sources coefficient?
    • I've been wondering how consistent the different languages are. One way to test this would be to train a separate model for each language and see how similar the coefficients are. This might help identify features that are less stable and worth investigating. Perhaps some insight on the sources feature or maybe even some of the others too. I expect the magnitudes might vary a bit but if any switch from positive to negative (or vice versa), that would be the interesting sign to pay attention to.
    • In these sorts of models, each feature coefficient is the impact of that feature when all other features are held constant. Because of this dependence on other features, the coefficient for a feature like sources doesn't just depend on its relationship to quality but also on its relationship to all the other features. I think this is what is going on. This (quite long) tutorial on linear model coefficients/interpretation/fine-tuning has some useful examples of this and ways to chart things out. In particular, I would love to see similar charts for our data/model to:
      • The pairplots in this section, which will help us see which coefficients are correlated with each other.
      • The coefficient variability plot in this section, which will help us see which model coefficients are unstable.
      • If it turns out that sources is highly correlated with the other features, we might have to take it out. It might also be possible to e.g., switch it to a simpler boolean (>= 5 sources) and that that would help reduce the correlation while still retaining most of the benefits of counting unique sources separately from references.