Stale isbn hyphenation data
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	jayvdb
	Apr 18 2016, 2:55 PM

Description

On https://gerrit.wikimedia.org/r/#/c/283940/ , there is a discussion regarding the staleness of the isbn metadata used by pywikibot.
And https://gerrit.wikimedia.org/r/#/c/209176/ has proposed adding a copy of the ISBN metadata to the pywikibot library , so that it can be updated by the core Pywikibot team.

cosmetic changes currently uses external library stdnum.isbn as a result of T89996: Add isbn package dependency, if it is installed, and falls back to using the ISBN implementations in scripts.isbn.

Specifically, the problem identified is that stdnum v 1.0 released 2014-10-19 incorrectly hyphenates some German ISBNs, and some German website fail when given an incorrectly formatted ISBN.

e.g. https://portal.dnb.de/opac.htm?query=978-3-95539-063-1&method=simpleSearch&cqlMode=true is ok
but https://portal.dnb.de/opac.htm?query=978-3-9553906-3-1&method=simpleSearch&cqlMode=true fails.

The known German stale data problem was fixed in v1.1 released 2015-04-27 with an updated dataset.

T85240 was where I first did an analysis of the available libraries.

The following package their data into the release, as a static dataset
https://pypi.python.org/pypi/isbn_hyphenate - https://github.com/TorKlingberg/isbn_hyphenate/commits/master/isbn_hyphenate/isbn_lengthmaps.py
https://pypi.python.org/pypi/isbnid - https://github.com/nekobcn/isbnid/commits/master/data
https://pypi.python.org/pypi/isbnlib - https://github.com/xlcnd/isbnlib/commits/master/isbnlib/_data/data4mask.py

Another package not previously seriously considered is pyisbn, as it is wasnt very actively maintained, and it does not handle hyphenation at all, however they do not for a very good reason. There is an Upstream ticket for adding hyphenation: https://github.com/JNRowe/pyisbn/issues/3 , from 2011. I've added a comment to it now. The primary issue is that while the International ISBN Agency provides machine readable version of their data , they do not provide a license to redistribute the information. They also cite staleness of the data as a problem.

fwiw, it credits include @notconfusing , who has one commit.

Details

Subject	Repo	Branch	Lines +/-
[bugfix] Remove isbn library of isbn.py in favour of stdnum	pywikibot/core	master	+55 -1 K
[isbn] Always use the newest release of python-stdnum	pywikibot/core	master	+2 -2
[bugfix] Use RangeMessage map xml file from International ISBN Agency	pywikibot/core	master	+6 K -1 K

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T132919 Stale isbn hyphenation data
		Resolved		Xqt	T243157 Drop support of isbn_hyphenate

Event Timeline

jayvdb created this task.Apr 18 2016, 2:55 PM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptApr 18 2016, 2:55 PM

Another serious concern regarding staleness is that bots might war with each other over the correct hyphenation.

We might be able to prevent staleness being a serious problem by having a safe mode where existing hyphenation is not overwritten if the installed library data is stale.

This could be determined by doing a HEAD request on the source data URL, and comparing it with a date sensitive part of the external library. This would need to be done even if the source data was packaged as part of pywikibot, otherwise we've not solved the problem as someone could be running an old version of pywikibot, with old data.

If none of the external libraries are interested in solving the staleness problem properly, we could fork an existing library to add staleness detection.

https://github.com/JNRowe/pyisbn/issues/3 has been closed as WONTFIX.
So we need to look at working with one of the other libraries.

Change 209176 had a related patch set uploaded (by Xqt):
[bugfix] Use RangeMessage map xml file from International ISBN Agency

https://gerrit.wikimedia.org/r/209176

gerritbot added a project: Patch-For-Review.Oct 2 2016, 4:50 PM

Change 209176 abandoned by Multichill:
[bugfix] Use RangeMessage map xml file from International ISBN Agency

Reason:
No response. This can always be re-opened if you plan to work on it again.

https://gerrit.wikimedia.org/r/209176

Xqt moved this task from Backlog to Needs Review on the Pywikibot board.Feb 3 2019, 11:23 AM

Change 565772 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [isbn] Always use the newest release of python-stdnum

https://gerrit.wikimedia.org/r/565772

Xqt added a subtask: T243157: Drop support of isbn_hyphenate.Jan 19 2020, 4:12 PM

Change 565784 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Remove isbn library of isbn.py in favour of stdnum

https://gerrit.wikimedia.org/r/565784

Change 565772 merged by jenkins-bot:
[pywikibot/core@master] [isbn] Always use the newest release of python-stdnum

https://gerrit.wikimedia.org/r/565772

Xqt mentioned this in rPWBCf498990a9d02: [isbn] Always use the newest release of python-stdnum.Jan 20 2020, 1:37 AM

Xqt closed subtask T243157: Drop support of isbn_hyphenate as Resolved.Jan 25 2020, 7:33 PM

Change 565784 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Remove isbn library of isbn.py in favour of stdnum

https://gerrit.wikimedia.org/r/565784

Xqt mentioned this in rPWBCf22cfbf5e208: [bugfix] Remove isbn library of isbn.py in favour of stdnum.Jan 30 2020, 1:50 PM

What is missing here?

@Xqt?

stdnum is not as up-to-date as others I guess

Pppery removed a project: Patch-For-Review.Mar 29 2023, 12:14 AM