Page MenuHomePhabricator

Stale isbn hyphenation data
Open, Needs TriagePublic

Description

On https://gerrit.wikimedia.org/r/#/c/283940/ , there is a discussion regarding the staleness of the isbn metadata used by pywikibot.
And https://gerrit.wikimedia.org/r/#/c/209176/ has proposed adding a copy of the ISBN metadata to the pywikibot library , so that it can be updated by the core Pywikibot team.

cosmetic changes currently uses external library stdnum.isbn as a result of T89996: Add isbn package dependency, if it is installed, and falls back to using the ISBN implementations in scripts.isbn.

Specifically, the problem identified is that stdnum v 1.0 released 2014-10-19 incorrectly hyphenates some German ISBNs, and some German website fail when given an incorrectly formatted ISBN.

e.g. https://portal.dnb.de/opac.htm?query=978-3-95539-063-1&method=simpleSearch&cqlMode=true is ok
but https://portal.dnb.de/opac.htm?query=978-3-9553906-3-1&method=simpleSearch&cqlMode=true fails.

The known German stale data problem was fixed in v1.1 released 2015-04-27 with an updated dataset.

T85240 was where I first did an analysis of the available libraries.

The following package their data into the release, as a static dataset
https://pypi.python.org/pypi/isbn_hyphenate - https://github.com/TorKlingberg/isbn_hyphenate/commits/master/isbn_hyphenate/isbn_lengthmaps.py
https://pypi.python.org/pypi/isbnid - https://github.com/nekobcn/isbnid/commits/master/data
https://pypi.python.org/pypi/isbnlib - https://github.com/xlcnd/isbnlib/commits/master/isbnlib/_data/data4mask.py

Another package not previously seriously considered is pyisbn, as it is wasnt very actively maintained, and it does not handle hyphenation at all, however they do not for a very good reason. There is an Upstream ticket for adding hyphenation: https://github.com/JNRowe/pyisbn/issues/3 , from 2011. I've added a comment to it now. The primary issue is that while the International ISBN Agency provides machine readable version of their data , they do not provide a license to redistribute the information. They also cite staleness of the data as a problem.

fwiw, it credits include @notconfusing , who has one commit.

Event Timeline

jayvdb created this task.Apr 18 2016, 2:55 PM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptApr 18 2016, 2:55 PM

Another serious concern regarding staleness is that bots might war with each other over the correct hyphenation.

We might be able to prevent staleness being a serious problem by having a safe mode where existing hyphenation is not overwritten if the installed library data is stale.

This could be determined by doing a HEAD request on the source data URL, and comparing it with a date sensitive part of the external library. This would need to be done even if the source data was packaged as part of pywikibot, otherwise we've not solved the problem as someone could be running an old version of pywikibot, with old data.

If none of the external libraries are interested in solving the staleness problem properly, we could fork an existing library to add staleness detection.

https://github.com/JNRowe/pyisbn/issues/3 has been closed as WONTFIX.
So we need to look at working with one of the other libraries.

Change 209176 had a related patch set uploaded (by Xqt):
[bugfix] Use RangeMessage map xml file from International ISBN Agency

https://gerrit.wikimedia.org/r/209176

Change 209176 abandoned by Multichill:
[bugfix] Use RangeMessage map xml file from International ISBN Agency

Reason:
No response. This can always be re-opened if you plan to work on it again.

https://gerrit.wikimedia.org/r/209176

Xqt moved this task from Backlog to Needs Review on the Pywikibot board.Feb 3 2019, 11:23 AM

Change 565772 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [isbn] Always use the newest release of python-stdnum

https://gerrit.wikimedia.org/r/565772

Change 565784 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Remove isbn library of isbn.py in favour of stdnum

https://gerrit.wikimedia.org/r/565784

Change 565772 merged by jenkins-bot:
[pywikibot/core@master] [isbn] Always use the newest release of python-stdnum

https://gerrit.wikimedia.org/r/565772

Change 565784 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Remove isbn library of isbn.py in favour of stdnum

https://gerrit.wikimedia.org/r/565784

What is missing here?

Xqt added a comment.Jun 11 2020, 5:16 AM

stdnum is not as up-to-date as others I guess