Filter out typo'd DOIs from dataset to improve the accuracy of mwcites parsing.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T111066 Create recommendations for databases/journals/websites, by WikiProject for WikiProject X | |||
Open | None | T99046 Retrieve DOI metadata and identify non-resolving DOIs. |
Event Timeline
@DarTar, post your updates you talked about here: https://twitter.com/ReaderMeter/status/649597148862922753
hey guys,
per this tweet, I think this is what needs to happen next:
- we need to check the quality of DOIs extracted from the dumps (this is what the current task is about) and flag non-resolving DOIs. It might take a few days to make individual calls to the /works endpoint of the Crossref API for all 700K items in our list, if we want to test their validity and cache the metadata: I'm wondering if it's more effective to ask @Afandian to generate a clean dataset from this snapshot on his end.
- we need to decide what metadata we want to cache and import for an initial proof of concept. I pasted below the typical response of /works for a complete item. @Daniel_Mietchen, @Harej: your input would be great here.
- based on 2), we need to make sure the data model is complete and all the corresponding properties are implemented in LibraryBase
- finally, once 3) is done, we need to determine the best format to bulk ingest data into LibraryBase and how to deal with duplicate items, for example papers by the same author or published in the same journal. Should we ingest all the data first and merge duplicate items later or perform deduplication at the time of the import. Consider that the latter may generate false positives for homonyms. @Magnus your advice here would be great.
Does this make sense or am I missing anything? Our goal should be to
- sandbox the data model
- seed it with real data
- refine the corpus once it's imported
- pitch it to the community to figure out any data quality issue before we discuss any plans to import it into Wikidata.
CrossRef API response format: /works
{ "status": "ok", "message-type": "work", "message-version": "1.0.0", "message": { "indexed": { "date-parts": [ [ 2015, 9, 28 ] ], "timestamp": 1443414872011 }, "reference-count": 6, "publisher": "CrossRef Test Account", "issue": "11", "license": [ { "content-version": "tdm", "delay-in-days": 1195, "start": { "date-parts": [ [ 2011, 11, 21 ] ], "timestamp": 1321833600000 }, "URL": "http://psychoceramicsproprietrylicenseV1.com" } ], "funder": [ { "award": [ "DE-SC0001091" ], "name": "Basic Energy Sciences", "DOI": "10.13039/100006151" }, { "award": [ "CHE-1152342" ], "name": "National Science Foundation", "DOI": "10.13039/100000001" } ], "DOI": "10.5555/12345678", "type": "journal-article", "page": "1-3", "update-policy": "http://dx.doi.org/10.5555/crossmark_policy", "source": "CrossRef", "title": [ "Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory" ], "prefix": "http://id.crossref.org/prefix/10.5555", "volume": "5", "author": [ { "affiliation": [], "family": "Carberry", "given": "Josiah", "ORCID": "http://orcid.org/0000-0002-1825-0097" } ], "member": "http://id.crossref.org/member/7822", "container-title": [ "Journal of Psychoceramics" ], "deposited": { "date-parts": [ [ 2015, 9, 17 ] ], "timestamp": 1442448000000 }, "score": 1, "subtitle": [], "issued": { "date-parts": [ [ 2008, 8, 13 ] ] }, "URL": "http://dx.doi.org/10.5555/12345678", "ISSN": [ "0264-3561" ], "assertion": [ { "group": { "label": "Identifiers", "name": "identifiers" }, "label": "ORCID", "name": "orcid", "order": 0, "URL": "http://orcid.org/0000-0002-1825-0097", "value": "http://orcid.org/0000-0002-1825-0097" }, { "group": { "label": "Publication History", "name": "publication_history" }, "label": "Received", "name": "received", "order": 0, "value": "2012-07-24" }, { "group": { "label": "Publication History", "name": "publication_history" }, "label": "Accepted", "name": "accepted", "order": 1, "value": "2012-08-29" }, { "group": { "label": "Publication History", "name": "publication_history" }, "label": "Published", "name": "published", "order": 2, "value": "2012-09-10" } ] } }
@Halfak, see my notes above after our quick discussion. This goes beyond the scope of the data cleanup, but if we agree on the plan we can create separate tickets.
A tag for Librarybase would be useful.
The big thing missing from this is what Wikipedia articles the citation appears on. Librarybase is one of the outputs (so far) of T111066, a task to create a reference recommender system based on what sources are used in Wikipedia articles. A matchup of sources to articles would be useful for this purpose. Is this data available during the DOI mining process, or are the DOIs coming from another source?
In terms of test cases, I suggest to use the same sets of articles as in
https://www.wikidata.org/wiki/Wikidata:WikiProject_Source_MetaData/Source,_M.D./Tests#Test_case:_Malaria .
First, you can check for the existence of a DOI just by resolving the URL and seeing what response you get back. That's quicker than the API, and will serve to do a first pass of finding valid DOIs.
There's no problem calling the API for each DOI. I think at these volumes the balance between creating a dump of 80 million DOIs and keeping that up to date (the dataset changes all the time) and calling the API is in favour of calling the API.
We have the dataset we've been collecting with Cocytus to get things started. Maybe that can help? http://events.labs.crossref.org/events/types/WikipediaCitation/live
Also, I'm working on some code to skim through a data dump and extract resolving DOIs from text (I started it for the recent Reddit dump but it can be easily extended). It uses Apache Spark for high-speed distributed work. I would be happy to run it over a corpus if that helps.
We already have a high speed text extractor for MediaWiki XML dumps and dataset of DOIs. I'd like to start there with the metadata gathering. That's what this task is about. This task is only about gathering relevant metadata for the DOIs in that dataset. A second pass can be made over that metadata to explore opportunities to import it into librarybase or whatever. Producing a dataset that facilitates such a second pass should be our goal.
While it will take a long time to build a cache of metadata for DOIs, I don't see keeping it up to date as a problem.
My primary concern right now is that we can't just make calls to the works/ end point because we just have the DOIs and we don't know the agency. It seems like the crossref API will only return metadata for crossref and public DOIs. For example can't get metadata for our published dataset of scholarly identifiers via its DOI (http://api.crossref.org/works/10.6084/m9.figshare.1299540). The agency for that DOI is 'datacite'. So, is there a general API endpoint that we can use for any DOI or do we need to be more clever about hitting multiple APIs?
Yes. This is what Content Negotiation is for. The RAs (Registration Agencies) work together to expose this API via DOI. We also have various data formats that we all support. Information here: http://www.crosscite.org/cn/ . Long story short:
curl -LH "Accept: application/rdf+xml" http://dx.doi.org/10.1126/science.169.3946.635
curl -LH "Accept: application/rdf+xml" http://dx.doi.org/10.6084/m9.figshare.1299540
Also, I should mention that there is a special case in the Crossref API for looking up RA that works with any RA: http://api.crossref.org/works/10.6084/m9.figshare.1299540/agency
If you want to cache the *existence* of DOIs, that's a great idea, and would be trivial to implement. I would suggest you operate it as a cache though, and fill it request by request rather than front-loading it. If you really want to front-load it, we can explore that.
To check the existence of a DOI is simple, just try and resolve it:
curl -I http://dx.doi.org/10.5555/12345678
The 303 means it was found. No need to follow the redirect.
This task is only about gathering relevant metadata for the DOIs in that dataset. A second pass can be made over that metadata to explore opportunities to import it into librarybase or whatever. Producing a dataset that facilitates such a second pass should be our goal.
+1 on the scope of the current task, if we agree on the overall plan I'll move the next steps to separate phab tickets.
There's no problem calling the API for each DOI. I think at these volumes the balance between creating a dump of 80 million DOIs and keeping that up to date (the dataset changes all the time) and calling the API is in favour of calling the API.
For now we just need a snapshot of data to (a) improve the accuracy of mwcites (which includes historical data) and (b) prepare a sample dataset to import into LibraryBase. I uploaded a sorted list of unique DOIs extracted from http://dx.doi.org/10.6084/m9.figshare.1299540. It should be available as soon as the directory is rsynced at http://datasets.wikimedia.org/public-datasets/enwiki/mwcites/dois.txt.gz I agree that for future real-time import (DET -> LB or Citoid -> LB) we'll need to rely on APIs, for test purposes we can use whatever cheap and fast solution we have. I agree Cocytus is another good starting point to start modeling the LB schema.
@Halfak I attached the response for each unique DOI in the February 2015 dataset. 20,148 out of 738,480 (2.7%) are non-resolving.
@DarTar, are those actually not part of the DOI? Is it our fault that we don't strip them off or the resolver's fault for not understanding the suffix?
I'd be happy to strip these off if there was a way to consistently detect them. We could split the DOI on "/" and use a lookup table, but that's a bit hacky.
It looks like those suffix'd links are not to https://dx.doi.org -- they are usually to https://onlinelibrary.wiley.com. For example: http://onlinelibrary.wiley.com/doi/10.1046/j.1095-8339.2003.t01-1-00158.x/pdf
That website probably supports a limited list of suffixes like this. That's making me think the lookup table isn't such a bad idea after all.
@Halfak sounds about right. We should replace all occurrences of DOI links to publisher specific-resolvers (which may use non-canonical conventions) with CrossRef's resolver.
Another type of failure I see is looks like this: 10.1086/591526+10.1088/0004-637X/706/1/L203
I'm not sure how we'd be able to tell that a "+" is not part of the DOI.
When I search for this exact string, I found this listing: http://arxiv.org/abs/0805.4758 It seems that both DOIs are associated with the same paper. One of the paper itself and another is an errata for the paper!
I'm thinking that we might get high fitness by having a special rule in the parser for *splitting characters* like "+&". If we see them right before some whitespace or a new DOI_START, then stop reading the DOI.
Joe from Crossref chiming in here.
Having been in this position myself, the best way to do this is to build some heuristics based on the data you get (like "split whenever you see a new '10\.\d+'" or "exclude '/fulltext") and just try and resolve the candidate DOIs until something matches. One approach I took recently 'drop the last character off a suspected DOI until it resolves'. A HEAD request is cheap and in volumes of e.g. 100,000 not really a big issue.
NB: DOIs can have slashes in them.
It's a shame we don't have a full grammar for DOIs, but we are where we are.
Joe
I think you're right, + is a valid character but not if followed by whitespace or a DOI prefix.
I also noticed plenty of non-resolving DOIs with spurious URL parameters like:
10.1016/j.jesp.2003.11.001&title=Journal+of+experimental+social+psychology&volume=40&issue=5&date=2004&spage=586&issn=0022-1031
Note that these issues probably only apply to historical DOIs, if Citoid becomes the norm for adding scholarly citations going forward we should not have may of these problems in the future.
I've filed these issues as bugs and will see if I can get them cleaned up for a new test run. Dario, can you share a cache of the DOI/metadata pairs and the code you used to extract the metadata?
Bugs:
Copying what I told @Halfak over voice: I tested resolution by using @Afandian's suggestion with a few lines of code. I didn't retrieve the full metadata which can be obtained via the /works endpoint of the CrossRef API. This won't work for DOIs issued by other RAs, per T99046#1696974