Page MenuHomePhabricator

wrong author listed and wrong first/last name for the one author listed using ISBN lookup
Closed, ResolvedPublic

Description

citoid has wrong author listed and the wrong first/last name for the one author listed.

http://web.archive.org/web/20170318230352/https://www.worldcat.org/title/black-artists-of-the-new-generation/oclc/886799569

vs.

https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/0396074340

worldcat's first author listed doesn't appear in citoid output at all. second author listed in this case happens to be the foreward author not an author of the work proper

(via https://en.wikipedia.org/wiki/User:Versary19/sandbox )

"It is pulling dates into last name, putting the publisher as "other" and I don't know what else. This was reported by the Amon Carter Museum -- at first I thought it was just a weird record because you can get those but I had the same results with others. Also missing publication date."

Error occurs with:
9780995555563
0511040938
9780838916322

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

An update about this issue.
Using the newly released ISBN citation feature, we discovered that the first name/last name problem is still aroud.
How to reproduce:

  • insert the ISBN "9780123850591" (Computer networks by Larry L Peterson and Bruce S Davie)
  • the template is getting "S." as last name and "Davie, Bruce" as first name

https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/9780123850591

Exporting the citation from worldcat https://www.worldcat.org/title/computer-networks-a-systems-approach/oclc/781227361 generates correct data, the authors in the download file are:

  • Peterson, Larry L.
  • Davie, Bruce S.

@Mvolz this isn't even triaged, and yet it seems to affect every wiki, making ISBN ref generation a bit lame?

@Mvolz this isn't even triaged, and yet it seems to affect every wiki, making ISBN ref generation a bit lame?

I looked into this a extensively when I did T155161, hoping that adding another data format would fix the problem.

Unfortunately the data is very, very inconsistent, and actually worse in some ways in the MarcXML than the DublinCore format. They simply don't give us the data in a consistent and structured way from the API. From record to record it's often not even in the same field.

I agree the data looks okay from the worldcat website. I wonder if they have an internal data format that we don't have access to. Unfortunately we're only able to access their data in MarcXML and DublinCore and both of those formats have some significant flaws.

Maybe with some more sophisticated natural language processing we could do a little better but probably not 100% (particularily concerning the foreward author issue) :/

I can look into it again at some point.

Mvolz triaged this task as Medium priority.Nov 28 2017, 3:52 PM
Mvolz removed a project: Internet-Archive.

I agree the data looks okay from the worldcat website. I wonder if they have an internal data format that we don't have access to. Unfortunately we're only able to access their data in MarcXML and DublinCore and both of those formats have some significant flaws.

Thanks for the detailed reply. We do have contacts there, though? Can we maybe ask them what's going on? :)

Hi, this seems to be working differently than when launched. For example the ISBN 9788611177434 which is used in the WMF blog post https://blog.wikimedia.org/2017/05/11/wikimedia-oclc-partnership/ no longer works and in fact throws up an error. Have you worked with Karen Combs at OCLC? If not I can put you in touch.

Hi, that ISBN works for me at en.wp. Error messages can happen - you can fix them before or after saving. https://en.wikipedia.org/wiki/Help:CS1_errors#bad_date is the guide to fix the one that this specific source is giving me.

When I go back to edit the resulting citation it seems to have saved it as
a text block and not as a book citation with the various fields?

<s>Possibly an issue with that specific ISBN? A different one generated https://en.wikipedia.org/w/index.php?title=User%3AElitre_%28WMF%29%2Fsandbox&type=revision&diff=839628277&oldid=839601708 for me. </s>

Thankfully I have smarter colleagues who figure stuff out for me.

@Merrilee, that ISBN is for an audio recording, which (according to https://en.wikipedia.org/wiki/MediaWiki:Citoid-template-type-map.json ) is supposed to use the {{citation}} template rather than the {{cite book}} template. (It's strange that the blog post mentions this ISBN in text, but shows a different one in the image.)

The date error is generated by the local CS1 template, because "cop. 2007" is not a date format that it recognizes.

Mvolz updated the task description. (Show Details)

Change 497315 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Don't split authors by default

https://gerrit.wikimedia.org/r/497315

Change 497315 merged by jenkins-bot:
[mediawiki/services/citoid@master] Don't split authors by default

https://gerrit.wikimedia.org/r/497315

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

Unfortunately, this entirely broke all author data for ISBN look-ups for RefToolbar, which excludes any author data that has a comma in it (under the assumption that this indicates the data is malformed). It took me a long time to track down where this regression came from. In the future, could you please post a notice at https://en.wikipedia.org/wiki/Wikipedia_talk:RefToolbar any time you make a significant change to what Citoid returns? Thanks!

FWIW, https://gerrit.wikimedia.org/r/497315 doesn't seem like a great solution. You end up getting weird punctuation in the citations, like:

Shird, Kevin,. The colored waiting room : empowering the original and the new civil rights movements. Malden, Nelson,. New York. ISBN 978-1-948062-01-5. OCLC 1029842051.

King, Coretta Scott, 1927-2006,. My life, my love, my legacy. Reynolds, Barbara A., (First edition ed.). New York. ISBN 978-1-62779-598-2. OCLC 950430557.

Is it time to switch over to Zotero for all ISBN look-ups, maybe (T214802)? Is there any downside to just switching over to Zotero entirely? The output from Worldcat seems pretty abysmal.

kaldari renamed this task from wrong author listed and wrong first/last name for the one author listed to wrong author listed and wrong first/last name for the one author listed using ISBN lookup.Aug 3 2020, 7:42 PM

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

Unfortunately, this entirely broke all author data for ISBN look-ups for RefToolbar, which excludes any author data that has a comma in it (under the assumption that this indicates the data is malformed). It took me a long time to track down where this regression came from. In the future, could you please post a notice at https://en.wikipedia.org/wiki/Wikipedia_talk:RefToolbar any time you make a significant change to what Citoid returns? Thanks!

Sorry - didn't occur to me! @Whatamidoing-WMF - do we have any sort of process on communicating en wiki stuff about citoid documented somewhere? Is it mostly just the VE boards?

FWIW, https://gerrit.wikimedia.org/r/497315 doesn't seem like a great solution. You end up getting weird punctuation in the citations, like:

Shird, Kevin,. The colored waiting room : empowering the original and the new civil rights movements. Malden, Nelson,. New York. ISBN 978-1-948062-01-5. OCLC 1029842051.

King, Coretta Scott, 1927-2006,. My life, my love, my legacy. Reynolds, Barbara A., (First edition ed.). New York. ISBN 978-1-62779-598-2. OCLC 950430557.

Is it time to switch over to Zotero for all ISBN look-ups, maybe (T214802)? Is there any downside to just switching over to Zotero entirely? The output from Worldcat seems pretty abysmal.

Yeah T214802 is probably worth looking into.

do we have any sort of process on communicating en wiki stuff about citoid

Yes. The process is:

  1. Signficant technical changes should be announced in Tech News – just go to https://meta.wikimedia.org/wiki/Tech/News/Next, and add a sentence or two with a link to the Phab task (Please do this now! It's not too late to be useful!), and
  2. People who use other people's services or tools (whether WMF or otherwise) are required to keep track of changes to their dependencies themselves. The English Wikipedia is not the only place that uses citoid, and RefToolbar is not the only enwiki tool to use it. People can start using citoid without giving us any notice. It is generally bad to privilege a single tool at a single, well-connected community.

On the other question, about switching "to Zotero", Zotero appears to use the Library of Congress (~39 million catalogued books) first, and WorldCat (~450 million records) when that fails. I didn't happen to grow up with French spacing, and I did grow up with title case for book titles, so I've often wanted to "correct" a few things in the WorldCat database, but I'm not sure that a worldwide service would find all of the "weird" punctuation as weird as I do, and they might find some of the Library of Congress's American-style punctuation and capitalization to be wrong. In the end, whatever source we use, some people won't like it.

@Whatamidoing-WMF - It looks like the Library of Congress is using the same "weird" punctuation and capitalization as WorldCat:

Maybe it's some kind of library database standard or maybe the LOC is reusing WorldCat data for their own books.
@Mvolz - Can you report the titles that you get for the following ISBN numbers via Zotero's API: 9781608468553, 9780553803549, 9780912647180. Just want to confirm that they are similar to those returned by the LOC search.

@Whatamidoing-WMF - It looks like the Library of Congress is using the same "weird" punctuation and capitalization as WorldCat:

Maybe it's some kind of library database standard or maybe the LOC is reusing WorldCat data for their own books.
@Mvolz - Can you report the titles that you get for the following ISBN numbers via Zotero's API: 9781608468553, 9780553803549, 9780912647180. Just want to confirm that they are similar to those returned by the LOC search.

Yup, same titles!

@kaldari re this change in general, how would you feel if I reverted it and just improved the results using the code you use on refToolBar?

This change was pretty hacky and not really best practice so I think maybe undoing it is the best option.

@Mvolz - I would be fine with that, but there are a couple caveats:

  • The refToolbar code isn't foolproof, and in fact, there is no foolproof way to handle the WorldCat author data since its formatting is just too inconsistent.
  • It currently handles "Jr." and "Sr.", but not suffixes or prefixes that may exist in other languages.

If you want to use it, the most recent version of the code is at https://github.com/alexz-enwp/reftoolbar/blob/master/lookup.php#L139.

In the meantime, I'll see if we can ask our contact at WorldCat if there are any better options.

@Mvolz - I would be fine with that, but there are a couple caveats:

  • The refToolbar code isn't foolproof, and in fact, there is no foolproof way to handle the WorldCat author data since its formatting is just too inconsistent.
  • It currently handles "Jr." and "Sr.", but not suffixes or prefixes that may exist in other languages.

If you want to use it, the most recent version of the code is at https://github.com/alexz-enwp/reftoolbar/blob/master/lookup.php#L139.

In the meantime, I'll see if we can ask our contact at WorldCat if there are any better options.

It looks like they've released a v2 of their search api since I last looked that returns JSON: https://developer.api.oclc.org/wcv2 (instead of MarcXML or Dublin core, which is what we use now).

I think switching to that is a good idea, although even just using their sample input there's still a bit of junk left over in the authors field (a stray period!) But probably better than doing that processing ourselves!

It looks like they've released a v2 of their search api since I last looked that returns JSON: https://developer.api.oclc.org/wcv2 (instead of MarcXML or Dublin core, which is what we use now).

I think switching to that is a good idea, although even just using their sample input there's still a bit of junk left over in the authors field (a stray period!) But probably better than doing that processing ourselves!

Note this doesn't really solve the whole issue because it's only available from their open search api. The metadata api (which we use for requests for isbns) still only returns MarcXML and DublinCore. Maybe we could nudge them into offering JSON for their https://developer.api.oclc.org/wc-metadata too?

Change 632732 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

Change 632732 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

@kaldari
I've implemented the reftoolbar solution in this change.
Note this unfixes T203361 and T218125.

Also we get some bad results still. For example, DK Publishing, Inc. becomes FirstName: Inc, Last Name: DK Publishing, and in a formatted citation on en wiki the author will be Inc DK Publishing. Yes, a publisher doesn't belong in the author field but there you go.

The new api does seem to indicate whether or not the contributor is a person, here:

"contributor": {
   "creators": [
     {
       "firstName": {
         "text": "Seth"
       },
       "secondName": {
         "text": "Grahame-Smith"
       },
       "type": "person",
       "creatorNotes": [
         "author."
       ],
       "relators": [
         {
           "term": "Author.",
           "alternateTerm": "aut"
         }
       ]
     },
     {
       "firstName": {
         "text": "Jane"
       },
       "secondName": {
         "text": "Austen"
       },
       "type": "person",
       "creatorNotes": [
         "1775-1817. http://rdaregistry.info/Elements/w/P10197"
       ],
       "relators": [
         {
           "alternateTerm": "http://rdaregistry.info/Elements/w/P10197"
         }
       ]
     }
   ],

So maybe we just need to wait for that? Or merge this now, and live with it until then, because we don't have access to that api yet?

Change 632732 merged by jenkins-bot:
[mediawiki/services/citoid@master] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

This is now deployed. I have checked reftoolbar and VE citoid and seems to work okay but if there any other tools this causes problems for it can be reopened.