Page MenuHomePhabricator

Use crossref to search for human-readable citations copy-pasted from a bibliography in a PDF
Closed, ResolvedPublic

Description

It is very common for me to want to cite a journal article on Wikipedia whose reference I obtain by copy-pasting the human-readable citation from the References section of a PDF, e.g., the text "E. Schrodinger, Proc. Cam. Phil. Soc. 31, 555 (1935)". Pasting this text into Google Scholar returns the correct result with high reliability, and usually it is the only result. However, the otherwise-spectacular Citoid plugin for Wikipedia cannot make this deduction. So currently my workflow is to go through Google Scholar and then grab the URL to feed into Citoid, it would be great if this step could be eliminated.

Event Timeline

Restricted Application added a project: VisualEditor. · View Herald TranscriptOct 26 2017, 10:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Jess_Riedel updated the task description. (Show Details)Oct 26 2017, 10:11 PM

This issue must have been raised before, but I could not find it after significant searching on Phabricator. Apologies if so.

Mvolz added a subscriber: Mvolz.EditedNov 3 2017, 4:03 PM

If you've got a formatted citation already you can click the "manual" tab, select the "basic" option, and then paste it in. But it will be in plain text, not in a citation template.

T162357 might help with this in terms of making this work in the "automatic" tab. But, if it's not in the worldcat db, then it wouldn't work. But it might perform well enough.

Converting formatted citations into citation templates by parsing it out would be a little more complicated.

Mvolz triaged this task as Normal priority.Nov 3 2017, 4:03 PM

Thanks Mvolz. I agree that parsing a formatted citation into a citation template is complicated, but I think there may already be "off the shelf" solutions for this (Zotero?). And obviously, this work flow works and could be automated (with scraping, I guess): (1) past formatted citation in Google Scholar; (2) click "cite" button and choose bibtex formatting; (3) convert citation in bibtex format to citation in wiki format.

Mvolz moved this task from Backlog to Service on the Citoid board.Nov 9 2017, 10:41 AM
Mvolz added a comment.Nov 9 2017, 10:49 AM

Thanks Mvolz. I agree that parsing a formatted citation into a citation template is complicated, but I think there may already be "off the shelf" solutions for this (Zotero?). And obviously, this work flow works and could be automated (with scraping, I guess): (1) past formatted citation in Google Scholar; (2) click "cite" button and choose bibtex formatting; (3) convert citation in bibtex format to citation in wiki format.

Zotero doesn't do this, but their site lists some off the shelf solutions here: https://www.zotero.org/support/kb/importing_formatted_bibliographies

This one looks the most promising: http://www.molspaces.com/d_cb2bib-overview.php But none of them look like something we'd be likely to put into production.

The best case scenario would be a node.js library that would do this. I had a look around and didn't see any. The best I found is https://github.com/larsgw/citation.js#cite.in.type but it looks like it doesn't support pre-formatted citations, only converts TO them.

At Wikicite we have also had a demo of Bilbo, which attempts to parse plain text citations to extract metadata:
https://github.com/OpenEdition/bilbo
I am in touch with the authors of this tool and they were interested in adapting it to wikitext (but no concrete plans yet).

Hmm. Took me a while to figure out how to work the online demonstration. It needs to be formatted as XML like the examples. Unfortunately, when I inserted some references from a paper I currently am reading (like "R. Delbourgo and J. R. Fox, J. Phys. A 10, L233 (1977)."), it mistakes the journal "J. Phys. A" (i.e., "Journal of Physics A") for a person's name. In contrast, copy-pasting this into Google Scholar returns the correct result.

Based on the description of the way the code works, I think the major issue is that tries to interpret a given reference based mostly on looking at the raw text with Conditional Random Fields rather than matching it up against a known database of correct references.

Still, this is going in the correct direction.

Mvolz claimed this task.Apr 23 2018, 11:48 AM
Mvolz renamed this task from Citoid plugin for Wikipedia does not identify raw human-readable citations copy-pasted from a bibliography in a PDF to Use crossref to search for human-readable citations copy-pasted from a bibliography in a PDF.Apr 23 2018, 11:58 AM

Change 429172 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Add open search capability with crossref

https://gerrit.wikimedia.org/r/429172

Change 429172 merged by jenkins-bot:
[mediawiki/services/citoid@master] Add open search capability with crossref

https://gerrit.wikimedia.org/r/429172

Mentioned in SAL (#wikimedia-operations) [2018-05-17T10:09:52Z] <mobrovac@tin> Started deploy [citoid/deploy@8a26508]: Update citoid to 2f35126 - T179123 T185217

Mentioned in SAL (#wikimedia-operations) [2018-05-17T10:12:44Z] <mobrovac@tin> Finished deploy [citoid/deploy@8a26508]: Update citoid to 2f35126 - T179123 T185217 (duration: 02m 52s)

Mvolz closed this task as Resolved.May 17 2018, 10:28 AM
Mvolz removed a project: Patch-For-Review.
Mvolz removed subscribers: Stashbot, gerritbot.
Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptMay 17 2018, 10:28 AM

This is now done, but please note it only works okay with journal articles and book chapters in the crossref repository... it will give nonsense if you're looking for a book. Books will be T162357.

Wow! I'm very impressed. Minor issue: The page number doesn't seem to get filled in correctly, and for many journals (like Physical Review Letters), page number is crucial for identifying the article. For instance, I pasted

G. Vidal, “Class of quantum many-body states that can be efficiently simulated,” Phys. Rev. Lett. 101, 110501 (2008).

into Citoid and got

<ref>{{cite journal|first1=G.|last1=Vidal|title=Class of Quantum Many-Body States That Can Be Efficiently Simulated|url=http://dx.doi.org/10.1103/physrevlett.101.110501|journal=Physical Review Letters|date=12 September 2008|issn=0031-9007,1079-7114|volume=101|issue=11|doi=10.1103/physrevlett.101.110501}}</ref>

which lacks a page number. (Also, incidentally, the "issn" entry causes an error.)

(Should I open a new ticket for this?)

Mvolz added a comment.Jul 1 2018, 1:56 PM

Wow! I'm very impressed. Minor issue: The page number doesn't seem to get filled in correctly, and for many journals (like Physical Review Letters), page number is crucial for identifying the article. For instance, I pasted

G. Vidal, “Class of quantum many-body states that can be efficiently simulated,” Phys. Rev. Lett. 101, 110501 (2008).

into Citoid and got

<ref>{{cite journal|first1=G.|last1=Vidal|title=Class of Quantum Many-Body States That Can Be Efficiently Simulated|url=http://dx.doi.org/10.1103/physrevlett.101.110501|journal=Physical Review Letters|date=12 September 2008|issn=0031-9007,1079-7114|volume=101|issue=11|doi=10.1103/physrevlett.101.110501}}</ref>

which lacks a page number. (Also, incidentally, the "issn" entry causes an error.)
(Should I open a new ticket for this?)

Seems to work for me now?

Seems to work for me now?

Unfortunately, the issue still exists for me. Today, I again pasted the above citation by G. Vidal into the "URL" bar on Citoid, and I got the above reference under the output "Template code". For completeness, here is the "Full citoid data":

[
  {
      "itemType": "journalArticle",
      "issue": "11",
      "DOI": "10.1103/physrevlett.101.110501",
      "title": "Class of Quantum Many-Body States That Can Be Efficiently Simulated",
      "volume": "101",
      "publicationTitle": "Physical Review Letters",
      "date": "2008-09-12",
      "url": "http://dx.doi.org/10.1103/physrevlett.101.110501",
      "ISSN": [
          "0031-9007",
          "1079-7114"
      ],
      "accessDate": "2018-07-04",
      "author": [
          [
              "G.",
              "Vidal"
          ]
      ],
      "source": [
          "Crossref"
      ]
  }
]

Notice that there is no page number listed. In this case, the page number is 110501, which only appears in the url and the DOI. (It is crucial that this number is added as a "page" in the citation or else it will not generate correctly when appearing in a Wikipedia article.) I performed this citoid search on Mac 10.13.5 using both Chrome 67.0.3396.87 and Safari 11.1.1 (13605.2.8) and got the same thing.

Incidentally, now that Citoid has this powerful feature, it may be better to rename the search bar from "URL" to something more general.

Thanks again for building this highly useful tool.