We want to know how common it is for refs to have literally identical and nearly-identical content. Identical content can be collapsed so is a missed opportunity to use ref reuse (or copied literally on purpose). Near content is a candidate for "extends" reuse, or is a mistake.
- Choose a fuzzy text comparison algorithm. Maybe Levenshtein?
- Extract the rendered content from each ref.
- Normalize content into text (?)
- Implement (half of) the Cartesian product comparing each ref body, and whether the ref name is equal.
- Categorize each pair as one of:
- identical and correct reuse
- identical and missed reuse
- near and incorrect reuse
- near and correctly not reused but could be an extends (maybe store edit distance metric with this one)
- Aggregate (TBD)
Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/50