Page MenuHomePhabricator

Compile a list of articles with "similar" references and hand-categorize use cases
Open, Needs TriagePublic

Assigned To
None
Authored By
awight
Tue, Jun 11, 1:23 PM
Referenced Files
F56162222: image.png
Tue, Jul 2, 7:16 AM
F56162143: image.png
Tue, Jul 2, 7:16 AM
F56162115: image.png
Tue, Jul 2, 7:16 AM
F56162026: image.png
Tue, Jul 2, 7:16 AM
F56162267: image.png
Tue, Jul 2, 7:16 AM
F56161899: image.png
Tue, Jul 2, 7:16 AM

Description

We want to understand why similar refs (currently defined by us as "differing by less than 10 characters in length, and with a Levenshtein edit distance less than 10% of the longer text") exist. The German Wikipedia users are also curious to have a list of articles they can clean up in which we've detected such similarities. One hypothesis is that these are going to mostly be small but meaningful differences, for example two different page numbers in the same book.

Cannot commit to doing this regularly, it's just a one-off thing at the moment.

  • Select a handful of wikis to process, which have varied template-ref usage patterns. Include dewiki.
    • dewiki (12% of pages include similar refs)
    • enwiki (12%)
    • jawiki (16%)
    • cawiki (7%)
  • Go through scraper output and produce a list of all articles with similar refs.
    • Columns: article title, URL to the page, count of similar refs, count of identical refs
    • Example query: gunzip -c < dewiki-20240501-page-summary.ndjson.gz | jq -c "select(.similar_ref_count > 0) | [.title, .ref_count, .similar_ref_count, .identical_ref_count]" > ~/dewiki-20240501-similars.ndjson
  • Choose a limited sample of these pages which we will look at manually.
    • Initial sample of 10 articles per wiki (40 total) was chosen by using shuf on the results of the command above.
  • Hand-code each page from our sample, letting the data define our categories. For example, "two different page numbers in same book", "accidental typo in otherwise identical refs", "accidental whitespace change", "Different issues of a magazine or another change which would not fit our extends feature"...
  • Package the output in a form that's easy to share with the German Wikipedia community and publish. Reflect on how "extends" could be applied in various cases, its strengths and weaknesses.

Summary of findings

Disclaimer: this was a small sampling of N=40 articles taken from 4 wikis, 10 per wiki. We can't draw any definite conclusions, but there was some consistency in the patterns across wikis.

Summary spreadsheet: https://docs.google.com/spreadsheets/d/1uWD6ojZP6uXGlgYtB6TX02NXKN_KpeD4wfYPObY1IpU/edit

Is there evidence for the "book referencing" use case?

  • Yes, in half of the sampled articles there was at least one pair of references which refer to the same book, only differing by a page or volume number.
  • Only one or two sampled articles on each wiki leaned heavily on book references, in most cases these were a small minority of the refs.

Were there any challenges in how "extends" would apply to these book referencing cases?

  • When the book is already mentioned in a Bibliography section, it wouldn't make sense to pull the book out and turn it into a list-defined reference. One possibility is to create a parent reference which points to the bibliography book:

image.png (171×362 px, 12 KB)

  • Nesting some references but not others breaks the visual equivalence between sources, even though 1 vs. 2 references coming from a source doesn't deserve such strong semantics.
  • Multi-column layout and very long (eg. 40) lists of indented subreferences both turn the layout into a narrow vertical strip.
  • In another extreme example, almost all references in the article are from two encyclopedic sources, which could make it harder to see the "parent" context:

image.png (228×799 px, 107 KB)

  • On all wikis other than German wikipedia, it seems much more common to use templates like {{sfn}} especially when referencing different pages of a book.

What were the other use cases which were detected as "similar" references?

  • The case of "different URL" was roughly twice as prevalent as book references. This is a use case in which the source could be grouped but each subreference points to a different Internet resource so the links will have to be inside of each subref. For example, there might be different years of a census or different episodes of a TV show. The title of the entire resource often changes for every subreference:

image.png (95×577 px, 34 KB)

This is an awkward fit for "extends" because the grouped resource might not have a natural name, and the subreferences still require separate metadata such as access date:
image.png (86×485 px, 13 KB)

Another example:
image.png (213×408 px, 37 KB)

  • Sometimes there's a single author eg. journalist and newspaper which resemble a single "source" in the sense that we want to consolidate, but the articles don't fit well into a single group, since they're usually considered independent units:

image.png (108×404 px, 24 KB)

  • We saw only a handful of "should have been identical", which were all caused by slightly different ways of typing the same reference. FWIW, these are usually easy to see if the references are shown in an alphabetically-sorted list.

Event Timeline

awight removed awight as the assignee of this task.Tue, Jul 2, 7:16 AM
awight updated the task description. (Show Details)
awight moved this task from Doing to UX/PM Review on the WMDE-TechWish-Sprint-2024-06-26 board.

Question from @WMDE-Fisch which should be investigated:

Would be interesting if these [book referencing] examples use "the" cite book template.

Would be interesting if these [book referencing] examples use "the" cite book template.

The answer was that dewiki is the exception and in our sample bare refs are created. But on other wikis the book referencing is normally done using templates like {{sfn}} (or templates with even more magic included).

Would be interesting if these [book referencing] examples use "the" cite book template.

The answer was that dewiki is the exception and in our sample bare refs are created. But on other wikis the book referencing is normally done using templates like {{sfn}} (or templates with even more magic included).

Interesting, I was under the impression that dewiki also usually used {{Cite book}} to format references, but you say that's not the case? And for enwiki, I was under the impression that most (if not all) parent references where {{sfn}}-templates point to, also contain {{Cite book}}, eg:

<ref>{{cite book |last=Shulman |first=Seth |title=The Telephone Gambit: Chasing Alexander Bell's Secret |location=New York |publisher=Norton & Company |date=2008 |page=[https://archive.org/details/telephonegambitc00shul/page/49 49] |isbn=978-0-393-06206-9 |url=https://archive.org/details/telephonegambitc00shul/page/49 }}</ref>

and

{{sfn|Shulman|2008|p=46}}

Is that different from your understanding of the situation?

Wrt to @WMDE-Fisch 's question - could you then not 'count' how many of the similar references contain {{Cite book}} vs journal, news, or other templates? I would love to see the division, and then potentially take a sample of 10 examples each of similar refs that:

  • contain {{Cite book}}
  • contain any other template that is present in at least 10% of similar refs
  • contain no templates whatsoever

That might really help me understand the space more deeply, formulate assumptions on the potential for sub-references to be adopted for each prevalent case, and craft some example materials to validate those assumptions.