Port djvutext.py to core
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	valhallasw
	May 4 2014, 8:34 PM

Description

Might be non-trivial due to dependency on a djvu reading program ('djvused'), part of djvulibre-bin

Version: core-(2.0)
Severity: enhancement

Details

Reference: bz64853

	Subject	Repo	Branch	Lines +/-
	Add djvulibre-bin to travis apt package list	pywikibot/core	master	+1 -0
	Added DjVuFile class and djvutext.py in core	pywikibot/core	master	+390 -0

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Invalid		None	T72936 Important tasks to be solved (tracking)
Open	Feature	None	T57880 Functionality existing in compat but missing from core (tracking)
Resolved		Mpaa	T66853 Port djvutext.py to core

Event Timeline

• bzimport raised the priority of this task from to High.Nov 22 2014, 3:14 AM

• bzimport added a project: Pywikibot-Scripts.

• bzimport set Reference to bz64853.

• bzimport added a subscriber: Unknown Object (????).

valhallasw created this task.May 4 2014, 8:34 PM

I do not think this is really needed. With support of "preload" (see https://bugzilla.wikimedia.org/show_bug.cgi?id=58963), if an Index Page is present on wikisource, one can get page text with the API even if a page is not created yet. This is what I usually do on en.WS.

Change 132816 had a related patch set uploaded by Mpaa:
Bug 64853 - Port djvutext.py to core

https://gerrit.wikimedia.org/r/132816

https://gerrit.wikimedia.org/r/132816 has been merged.
But we don't have djvutext.py, yet.

As I said above, I cannot see a use case. If one has a djvu and wishes to upload the text to wikisource, first of all he will create an Index linked to that djvu.
Once that is done, you can fetch the text with the patch above directly from the site.

No dependencies, no errors and paginations is handled by the Proofread extensions.

Aklapper added a project: Pywikibot.Nov 27 2014, 4:15 PM

jayvdb lowered the priority of this task from High to Low.Dec 5 2014, 4:09 AM

jayvdb edited projects, added Pywikibot-compat-to-core; removed Pywikibot, Pywikibot-Scripts.

jayvdb set Security to None.

jayvdb removed a subscriber: Unknown Object (????).

I think it would still be useful to have a script which batch uploaded the OCR text for the entire work using the preload functionality added by @Mpaa.
But would like to hear more active All-and-every-Wikisource contributors opinion on that.

jayvdb moved this task from Backlog to Scripts being investigated on the Pywikibot-compat-to-core board.Dec 5 2014, 4:28 AM

Ricordisamoa added a project: All-and-every-Wikisource.Apr 7 2015, 9:14 AM

Nemo_bis subscribed.Apr 7 2015, 10:50 AM

Actually one possible use case is when Pages have been created from an old version of djvu file, then a new djvu with an improved text layer and Page is available and content needs to be overwritten. Maybe quite a remote use case ...

...or not so remote if T34695 ever approaches reality. In general, the issue of syncing Wikisource <-> their DjVu source <-> the source digital library is IMHO the biggest open question in the Wikisource model.

This is more than enWS, and more than a policy at one site.

I believe that there is still a place for the script.

The script should be considered more versatile than English Wikisource. If the other WSes (or indeed other external sites) wish to utilise the tool, then go for it
If enWS received a perfect djvu book that hadn't relied on OCR then why wouldn't we want to have a bot do all the work.

I cannot speak for the internals of the Proofread application, or pywiki, so you will need to determine that level of complexity

In T66853#1187109, @Mpaa wrote:

Actually one possible use case is when Pages have been created from an old version of djvu file, then a new djvu with an improved text layer and Page is available and content needs to be overwritten. Maybe quite a remote use case ...

That's bordering on laughable. I myself have blindly bot created entire thousand page+ Index:es of generally worthless OCR'd crap not knowing any better in my early WS days and can attest most everyone with any time under their belt has done the same at some point or another too. I know skipping the current deletion step needed to replace inferior OCR's text after a source file replacement would be useful regardless of how many people actually go back and do something about these 'poor decision' cases.

Change 210808 had a related patch set uploaded (by Mpaa):
Added DjVu class and djvutext.py in core

https://gerrit.wikimedia.org/r/210808

gerritbot added a project: Patch-For-Review.May 14 2015, 10:55 PM

Change 210808 merged by jenkins-bot:
Added DjVuFile class and djvutext.py in core

https://gerrit.wikimedia.org/r/210808

Mpaa mentioned this in rPWBCf76a2403b7dd: Added DjVuFile class and djvutext.py in core.Jun 11 2015, 10:23 PM

jayvdb closed this task as Resolved.Jun 11 2015, 11:53 PM

jayvdb assigned this task to Mpaa.

Change 224199 had a related patch set uploaded (by John Vandenberg):
Add djvulibre-bin to travis apt package list

https://gerrit.wikimedia.org/r/224199

Change 224199 merged by jenkins-bot:
Add djvulibre-bin to travis apt package list

https://gerrit.wikimedia.org/r/224199