We want to understand why similar refs (currently defined by us as "differing by less than 10 characters in length, and with a Levenshtein edit distance less than 10% of the longer text") exist. The German Wikipedia users are also curious to have a list of articles they can clean up in which we've detected such similarities. One hypothesis is that these are going to mostly be small but meaningful differences, for example two different page numbers in the same book.
Cannot commit to doing this regularly, it's just a one-off thing at the moment.
* [ ] Select a handful of wikis to process, which have varied template-ref usage patterns. Include dewiki.
* [ ] Go through scraper output and produce a list of articles with similar refs, (eg. with `jq` `select()`).
* Columns: article title, URL to the page, count of similar refs, count of identical refs
* Example query: `gunzip -c < ffwiki-20240301-page-summary.ndjson.gz | jq "select(.similar_ref_count > 0) | .title, .similar_ref_count, .identical_ref_count"`
* [ ] Choose a limited sample of these pages which we will look at manually.
* [ ] Hand-code each page from our sample, letting the data define our categories. For example, "two different page numbers in same book", "accidental typo in otherwise identical refs", "accidental whitespace change", "Different issues of a magazine or another change which would not fit our extends feature"...
* [ ] Package the output in a form that's easy to share with the German Wikipedia community and publish.