Page MenuHomePhabricator

Scraper: compute similarity between ref bodies
Closed, ResolvedPublic

Description

We want to know how common it is for refs to have literally identical and nearly-identical content. Identical content can be collapsed so is a missed opportunity to use ref reuse (or copied literally on purpose). Near content is a candidate for "extends" reuse, or is a mistake.

  • Choose a fuzzy text comparison algorithm. Maybe Levenshtein?
  • Extract the rendered content from each ref.
  • Normalize content into text (?)
  • Implement (half of) the Cartesian product comparing each ref body, and whether the ref name is equal.
  • Categorize each pair as one of:
    • identical and correct reuse
    • identical and missed reuse
    • near and incorrect reuse
    • near and correctly not reused but could be an extends (maybe store edit distance metric with this one)
  • Aggregate (TBD)

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/50