We want to understand why similar refs (currently defined by us as "differing by less than 10 characters in length, and with a Levenshtein edit distance less than 10% of the longer text") exist. The German Wikipedia users are also curious to have a list of articles they can clean up in which we've detected such similarities. One hypothesis is that these are going to mostly be small but meaningful differences, for example two different page numbers in the same book.
Cannot commit to doing this regularly, it's just a one-off thing at the moment.
* [] Select a handful of wikis to process, which have varied template-ref usage patterns. Include dewiki.
* dewiki (12% of pages include similar refs)
* enwiki (12%)
* jawiki (16%)
* cawiki (7%)
* [] Go through scraper output and produce a list of all articles with similar refs.
* Columns: article title, URL to the page, count of similar refs, count of identical refs
* Example query: `gunzip -c < dewiki-20240501-page-summary.ndjson.gz | jq -c "select(.similar_ref_count > 0) | [.title, .ref_count, .similar_ref_count, .identical_ref_count]" > ~/dewiki-20240501-similars.ndjson`
* [] Choose a limited sample of these pages which we will look at manually.
* Initial sample of 10 articles per wiki (40 total) was chosen by using `shuf` on the results of the command above.
* [] Hand-code each page from our sample, letting the data define our categories. For example, "two different page numbers in same book", "accidental typo in otherwise identical refs", "accidental whitespace change", "Different issues of a magazine or another change which would not fit our extends feature"...
* [ ] Package the output in a form that's easy to share with the German Wikipedia community and publish. Reflect on how "extends" could be applied in various cases, its strengths and weaknesses.
== Summary of findings ==
Disclaimer: this was a small sampling of N=40 articles taken from 4 wikis, 10 per wiki. We can't draw any definite conclusions, but there was some consistency in the patterns across wikis.
Summary spreadsheet: https://docs.google.com/spreadsheets/d/1uWD6ojZP6uXGlgYtB6TX02NXKN_KpeD4wfYPObY1IpU/edit
**Is there evidence for the "book referencing" use case?**
- Yes, in half of the sampled articles there was at least one pair of references which refer to the same book, only differing by a page or volume number.
- Only one or two sampled articles on each wiki leaned heavily on book references, in most cases these were a small minority of the refs.
**Were there any challenges in how "extends" would apply to these book referencing cases?**
- When the book is already mentioned in a Bibliography section, it wouldn't make sense to pull the book out and turn it into a list-defined reference. One possibility is to create a parent reference which points to the bibliography book:
{F56161899}
- Nesting some references but not others breaks the visual equivalence between sources, even though 1 vs. 2 references coming from a source doesn't deserve such strong semantics.
- Multi-column layout and very long (eg. 40) lists of indented subreferences both turn the layout into a narrow vertical strip.
- In another extreme example, almost all references in the article are from two encyclopedic sources, which could make it harder to see the "parent" context:
{F56162267}
**What were the other use cases which were detected as "similar" references?**
- The case of "different URL" was roughly twice as prevalent as book references. This is a use case in which the source could be grouped but each subreference points to a different Internet resource so the links will have to be inside of each subref. For example, there might be different years of a census or different episodes of a TV show. The title of the entire resource often changes for every subreference:
{F56162026}
This is an awkward fit for "extends" because the grouped resource might not have a natural name, and the subreferences still require separate metadata such as access date:
{F56162115}
Another example:
{F56162143}
- Sometimes there's a single author eg. journalist and newspaper which resemble a single "source" in the sense that we want to consolidate, but the articles don't fit well into a single group, since they're usually considered independent units:
{F56162222}