wrong author listed and wrong first/last name for the one author listed using ISBN lookup
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	jeremyb
	Mar 19 2017, 12:07 AM

Description

citoid has wrong author listed and the wrong first/last name for the one author listed.

http://web.archive.org/web/20170318230352/https://www.worldcat.org/title/black-artists-of-the-new-generation/oclc/886799569

vs.

https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/0396074340

worldcat's first author listed doesn't appear in citoid output at all. second author listed in this case happens to be the foreward author not an author of the work proper

(via https://en.wikipedia.org/wiki/User:Versary19/sandbox )

"It is pulling dates into last name, putting the publisher as "other" and I don't know what else. This was reported by the Amon Carter Museum -- at first I thought it was just a weird record because you can get those but I had the same results with others. Also missing publication date."

Error occurs with:
9780995555563
0511040938
9780838916322

Details

	Subject	Repo	Branch	Lines +/-
	Divide authors by first name and last name	mediawiki/services/citoid	master	+537 -67
	Don't split authors by default	mediawiki/services/citoid	master	+42 -42

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	None	T165321 New param/pattern in service for requesting a single citation based on a unique identifier
Resolved	Mvolz	T198567 Allow arbitrary search strings in citoid
Resolved	Mvolz	T179123 Use crossref to search for human-readable citations copy-pasted from a bibliography in a PDF
Resolved	Mvolz	T162357 Add support for worldcat search api xml results
Resolved	Mvolz	T160845 wrong author listed and wrong first/last name for the one author listed using ISBN lookup
Declined	None	T264083 Switch to WorldCat Search v2 api

Event Timeline

jeremyb created this task.Mar 19 2017, 12:07 AM

Restricted Application added projects: VisualEditor, Internet-Archive. · View Herald TranscriptMar 19 2017, 12:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mvolz moved this task from Backlog to Zotero on the Citoid board.Mar 20 2017, 10:53 AM

Jdforrester-WMF moved this task from To Triage to External and Administrivia on the VisualEditor board.Mar 21 2017, 7:11 PM

An update about this issue.
Using the newly released ISBN citation feature, we discovered that the first name/last name problem is still aroud.
How to reproduce:

insert the ISBN "9780123850591" (Computer networks by Larry L Peterson and Bruce S Davie)
the template is getting "S." as last name and "Davie, Bruce" as first name

https://en.wikipedia.org/api/rest_v1/data/citation/mediawiki/9780123850591

Exporting the citation from worldcat https://www.worldcat.org/title/computer-networks-a-systems-approach/oclc/781227361 generates correct data, the authors in the download file are:

Peterson, Larry L.
Davie, Bruce S.

@Mvolz this isn't even triaged, and yet it seems to affect every wiki, making ISBN ref generation a bit lame?

In T160845#3788695, @Elitre wrote:

@Mvolz this isn't even triaged, and yet it seems to affect every wiki, making ISBN ref generation a bit lame?

I looked into this a extensively when I did T155161, hoping that adding another data format would fix the problem.

Unfortunately the data is very, very inconsistent, and actually worse in some ways in the MarcXML than the DublinCore format. They simply don't give us the data in a consistent and structured way from the API. From record to record it's often not even in the same field.

I agree the data looks okay from the worldcat website. I wonder if they have an internal data format that we don't have access to. Unfortunately we're only able to access their data in MarcXML and DublinCore and both of those formats have some significant flaws.

Maybe with some more sophisticated natural language processing we could do a little better but probably not 100% (particularily concerning the foreward author issue) :/

I can look into it again at some point.

Mvolz triaged this task as Medium priority.Nov 28 2017, 3:52 PM

Mvolz removed a project: Internet-Archive.

In T160845#3792792, @Mvolz wrote:

I agree the data looks okay from the worldcat website. I wonder if they have an internal data format that we don't have access to. Unfortunately we're only able to access their data in MarcXML and DublinCore and both of those formats have some significant flaws.

Thanks for the detailed reply. We do have contacts there, though? Can we maybe ask them what's going on? :)

Dalba awarded a token.Apr 14 2018, 8:12 AM

Dalba subscribed.

Mvolz added a parent task: T162357: Add support for worldcat search api xml results.Apr 20 2018, 8:07 AM

Mvolz merged a task: T193567: Bug report: Misplaced or improper import/scraping.May 3 2018, 9:04 AM

Mvolz added subscribers: • Ocaasi_WMF, Merrilee, Jdforrester-WMF.

Hi, this seems to be working differently than when launched. For example the ISBN 9788611177434 which is used in the WMF blog post https://blog.wikimedia.org/2017/05/11/wikimedia-oclc-partnership/ no longer works and in fact throws up an error. Have you worked with Karen Combs at OCLC? If not I can put you in touch.

Hi, that ISBN works for me at en.wp. Error messages can happen - you can fix them before or after saving. https://en.wikipedia.org/wiki/Help:CS1_errors#bad_date is the guide to fix the one that this specific source is giving me.

When I go back to edit the resulting citation it seems to have saved it as
a text block and not as a book citation with the various fields?

<s>Possibly an issue with that specific ISBN? A different one generated https://en.wikipedia.org/w/index.php?title=User%3AElitre_%28WMF%29%2Fsandbox&type=revision&diff=839628277&oldid=839601708 for me. </s>

Thankfully I have smarter colleagues who figure stuff out for me.

@Merrilee, that ISBN is for an audio recording, which (according to https://en.wikipedia.org/wiki/MediaWiki:Citoid-template-type-map.json ) is supposed to use the {{citation}} template rather than the {{cite book}} template. (It's strange that the blog post mentions this ISBN in text, but shows a different one in the image.)

The date error is generated by the local CS1 template, because "cop. 2007" is not a date format that it recognizes.

Mvolz claimed this task.Jul 1 2018, 10:26 AM

Mvolz updated the task description. (Show Details)

Mvolz moved this task from Zotero to Service on the Citoid board.Jan 4 2019, 11:03 AM

Mvolz mentioned this in T214802: Enable and use or merge results from zotero ISBN search to improve ISBN results.Jan 28 2019, 1:44 PM

Izno subscribed.Mar 15 2019, 5:20 PM

Change 497315 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] Don't split authors by default

https://gerrit.wikimedia.org/r/497315

gerritbot added a project: Patch-For-Review.Mar 18 2019, 2:55 PM

Change 497315 merged by jenkins-bot:
[mediawiki/services/citoid@master] Don't split authors by default

https://gerrit.wikimedia.org/r/497315

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

Ladsgroup removed a project: Patch-For-Review.May 28 2019, 3:37 PM

Sophivorus mentioned this in T214061: Button to switch first and last name.Oct 2 2019, 4:59 PM

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

Unfortunately, this entirely broke all author data for ISBN look-ups for RefToolbar, which excludes any author data that has a comma in it (under the assumption that this indicates the data is malformed). It took me a long time to track down where this regression came from. In the future, could you please post a notice at https://en.wikipedia.org/wiki/Wikipedia_talk:RefToolbar any time you make a significant change to what Citoid returns? Thanks!

FWIW, https://gerrit.wikimedia.org/r/497315 doesn't seem like a great solution. You end up getting weird punctuation in the citations, like:

Shird, Kevin,. The colored waiting room : empowering the original and the new civil rights movements. Malden, Nelson,. New York. ISBN 978-1-948062-01-5. OCLC 1029842051.

King, Coretta Scott, 1927-2006,. My life, my love, my legacy. Reynolds, Barbara A., (First edition ed.). New York. ISBN 978-1-62779-598-2. OCLC 950430557.

Is it time to switch over to Zotero for all ISBN look-ups, maybe (T214802)? Is there any downside to just switching over to Zotero entirely? The output from Worldcat seems pretty abysmal.

kaldari renamed this task from wrong author listed and wrong first/last name for the one author listed to wrong author listed and wrong first/last name for the one author listed using ISBN lookup.Aug 3 2020, 7:42 PM

In T160845#6357203, @kaldari wrote:

I've now deployed a fix that no longer splits the authors, and just puts them all in the "last" field. So in terms of how it's rendered with most citation templates, it looks a lot better. However this is a bit hacky, so I'm leaving this open for right now. It will continue to leave in parts of the author name that we don't necessarily want, like the birth date in parens.

Unfortunately, this entirely broke all author data for ISBN look-ups for RefToolbar, which excludes any author data that has a comma in it (under the assumption that this indicates the data is malformed). It took me a long time to track down where this regression came from. In the future, could you please post a notice at https://en.wikipedia.org/wiki/Wikipedia_talk:RefToolbar any time you make a significant change to what Citoid returns? Thanks!

Sorry - didn't occur to me! @Whatamidoing-WMF - do we have any sort of process on communicating en wiki stuff about citoid documented somewhere? Is it mostly just the VE boards?

In T160845#6357392, @kaldari wrote:

FWIW, https://gerrit.wikimedia.org/r/497315 doesn't seem like a great solution. You end up getting weird punctuation in the citations, like:

Shird, Kevin,. The colored waiting room : empowering the original and the new civil rights movements. Malden, Nelson,. New York. ISBN 978-1-948062-01-5. OCLC 1029842051.

King, Coretta Scott, 1927-2006,. My life, my love, my legacy. Reynolds, Barbara A., (First edition ed.). New York. ISBN 978-1-62779-598-2. OCLC 950430557.

Is it time to switch over to Zotero for all ISBN look-ups, maybe (T214802)? Is there any downside to just switching over to Zotero entirely? The output from Worldcat seems pretty abysmal.

Yeah T214802 is probably worth looking into.

do we have any sort of process on communicating en wiki stuff about citoid

Yes. The process is:

Signficant technical changes should be announced in Tech News – just go to https://meta.wikimedia.org/wiki/Tech/News/Next, and add a sentence or two with a link to the Phab task (Please do this now! It's not too late to be useful!), and
People who use other people's services or tools (whether WMF or otherwise) are required to keep track of changes to their dependencies themselves. The English Wikipedia is not the only place that uses citoid, and RefToolbar is not the only enwiki tool to use it. People can start using citoid without giving us any notice. It is generally bad to privilege a single tool at a single, well-connected community.

On the other question, about switching "to Zotero", Zotero appears to use the Library of Congress (~39 million catalogued books) first, and WorldCat (~450 million records) when that fails. I didn't happen to grow up with French spacing, and I did grow up with title case for book titles, so I've often wanted to "correct" a few things in the WorldCat database, but I'm not sure that a worldwide service would find all of the "weird" punctuation as weird as I do, and they might find some of the Library of Congress's American-style punctuation and capitalization to be wrong. In the end, whatever source we use, some people won't like it.

@Whatamidoing-WMF - It looks like the Library of Congress is using the same "weird" punctuation and capitalization as WorldCat:

Maybe it's some kind of library database standard or maybe the LOC is reusing WorldCat data for their own books.
@Mvolz - Can you report the titles that you get for the following ISBN numbers via Zotero's API: 9781608468553, 9780553803549, 9780912647180. Just want to confirm that they are similar to those returned by the LOC search.

In T160845#6459280, @kaldari wrote:

@Whatamidoing-WMF - It looks like the Library of Congress is using the same "weird" punctuation and capitalization as WorldCat:

A world undone : the story of the Great War, 1914-1918

How we get free : black feminism and the Combahee River Collective

Ada, the enchantress of numbers : a selection from the letters of Lord Byron's daughter and her description of the first computer

Maybe it's some kind of library database standard or maybe the LOC is reusing WorldCat data for their own books.
@Mvolz - Can you report the titles that you get for the following ISBN numbers via Zotero's API: 9781608468553, 9780553803549, 9780912647180. Just want to confirm that they are similar to those returned by the LOC search.

Yup, same titles!

@kaldari re this change in general, how would you feel if I reverted it and just improved the results using the code you use on refToolBar?

This change was pretty hacky and not really best practice so I think maybe undoing it is the best option.

@Mvolz - I would be fine with that, but there are a couple caveats:

The refToolbar code isn't foolproof, and in fact, there is no foolproof way to handle the WorldCat author data since its formatting is just too inconsistent.
It currently handles "Jr." and "Sr.", but not suffixes or prefixes that may exist in other languages.

If you want to use it, the most recent version of the code is at https://github.com/alexz-enwp/reftoolbar/blob/master/lookup.php#L139.

In the meantime, I'll see if we can ask our contact at WorldCat if there are any better options.

In T160845#6500357, @kaldari wrote:

@Mvolz - I would be fine with that, but there are a couple caveats:

The refToolbar code isn't foolproof, and in fact, there is no foolproof way to handle the WorldCat author data since its formatting is just too inconsistent.

It currently handles "Jr." and "Sr.", but not suffixes or prefixes that may exist in other languages.

If you want to use it, the most recent version of the code is at https://github.com/alexz-enwp/reftoolbar/blob/master/lookup.php#L139.

In the meantime, I'll see if we can ask our contact at WorldCat if there are any better options.

It looks like they've released a v2 of their search api since I last looked that returns JSON: https://developer.api.oclc.org/wcv2 (instead of MarcXML or Dublin core, which is what we use now).

I think switching to that is a good idea, although even just using their sample input there's still a bit of junk left over in the authors field (a stray period!) But probably better than doing that processing ourselves!

It looks like they've released a v2 of their search api since I last looked that returns JSON: https://developer.api.oclc.org/wcv2 (instead of MarcXML or Dublin core, which is what we use now).

I think switching to that is a good idea, although even just using their sample input there's still a bit of junk left over in the authors field (a stray period!) But probably better than doing that processing ourselves!

Note this doesn't really solve the whole issue because it's only available from their open search api. The metadata api (which we use for requests for isbns) still only returns MarcXML and DublinCore. Maybe we could nudge them into offering JSON for their https://developer.api.oclc.org/wc-metadata too?

Change 632732 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

gerritbot added a project: Patch-For-Review.Oct 7 2020, 2:38 PM

In T160845#6525484, @gerritbot wrote:

Change 632732 had a related patch set uploaded (by Mvolz; owner: Mvolz):
[mediawiki/services/citoid@master] [WIP] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

@kaldari
I've implemented the reftoolbar solution in this change.
Note this unfixes T203361 and T218125.

Also we get some bad results still. For example, DK Publishing, Inc. becomes FirstName: Inc, Last Name: DK Publishing, and in a formatted citation on en wiki the author will be Inc DK Publishing. Yes, a publisher doesn't belong in the author field but there you go.

The new api does seem to indicate whether or not the contributor is a person, here:

"contributor": {
   "creators": [
     {
       "firstName": {
         "text": "Seth"
       },
       "secondName": {
         "text": "Grahame-Smith"
       },
       "type": "person",
       "creatorNotes": [
         "author."
       ],
       "relators": [
         {
           "term": "Author.",
           "alternateTerm": "aut"
         }
       ]
     },
     {
       "firstName": {
         "text": "Jane"
       },
       "secondName": {
         "text": "Austen"
       },
       "type": "person",
       "creatorNotes": [
         "1775-1817. http://rdaregistry.info/Elements/w/P10197"
       ],
       "relators": [
         {
           "alternateTerm": "http://rdaregistry.info/Elements/w/P10197"
         }
       ]
     }
   ],

So maybe we just need to wait for that? Or merge this now, and live with it until then, because we don't have access to that api yet?

Mvolz changed the status of subtask T264083: Switch to WorldCat Search v2 api from Open to Stalled.Nov 10 2020, 10:33 AM

Change 632732 merged by jenkins-bot:
[mediawiki/services/citoid@master] Divide authors by first name and last name

https://gerrit.wikimedia.org/r/632732

Maintenance_bot removed a project: Patch-For-Review.Nov 23 2020, 5:10 PM

Mvolz closed this task as Resolved.Feb 11 2021, 11:50 AM

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptFeb 11 2021, 11:50 AM

This is now deployed. I have checked reftoolbar and VE citoid and seems to work okay but if there any other tools this causes problems for it can be reopened.

PerfektesChaos mentioned this in T277409: Zotero is duplicating name of author in some requests.Mar 14 2021, 2:34 PM

Mvolz changed the status of subtask T264083: Switch to WorldCat Search v2 api from Stalled to Open.Apr 24 2023, 2:00 PM

Mvolz changed the status of subtask T264083: Switch to WorldCat Search v2 api from Open to Stalled.May 11 2023, 12:33 PM

Mvolz closed subtask T264083: Switch to WorldCat Search v2 api as Declined.Jun 12 2023, 12:21 PM

wrong author listed and wrong first/last name for the one author listed using ISBN lookupClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

wrong author listed and wrong first/last name for the one author listed using ISBN lookup
Closed, ResolvedPublic
Actions

Related Objects
Search...