Scraper: compute similarity between ref bodies
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	awight
	Apr 21 2023, 12:14 PM

Description

We want to know how common it is for refs to have literally identical and nearly-identical content. Identical content can be collapsed so is a missed opportunity to use ref reuse (or copied literally on purpose). Near content is a candidate for "extends" reuse, or is a mistake.

Choose a fuzzy text comparison algorithm. Maybe Levenshtein?
Extract the rendered content from each ref.
Normalize content into text (?)
Implement (half of) the Cartesian product comparing each ref body, and whether the ref name is equal.
Categorize each pair as one of:
- identical and correct reuse
- identical and missed reuse
- near and incorrect reuse
- near and correctly not reused but could be an extends (maybe store edit distance metric with this one)
Aggregate (TBD)

Code to review:
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/merge_requests/50

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T345411 Scraper: destroy Cloud VPS runner instance
Resolved	None	T341751 Publish dump scraper reports
Resolved	None	T335411 Scraper: produce spreadsheet of scraped statistics for comparing wikis
Resolved	awight	T332032 Create baseline statistics for reference usage (2023)
Resolved	None	T335186 Scraper: compute similarity between ref bodies