Might be non-trivial due to dependency on a djvu reading program ('djvused'), part of djvulibre-bin
Version: core-(2.0)
Severity: enhancement
Might be non-trivial due to dependency on a djvu reading program ('djvused'), part of djvulibre-bin
Version: core-(2.0)
Severity: enhancement
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Invalid | None | T72936 Important tasks to be solved (tracking) | |||
Open | Feature | None | T57880 Functionality existing in compat but missing from core (tracking) | ||
Resolved | Mpaa | T66853 Port djvutext.py to core |
I do not think this is really needed. With support of "preload" (see https://bugzilla.wikimedia.org/show_bug.cgi?id=58963), if an Index Page is present on wikisource, one can get page text with the API even if a page is not created yet. This is what I usually do on en.WS.
Change 132816 had a related patch set uploaded by Mpaa:
Bug 64853 - Port djvutext.py to core
https://gerrit.wikimedia.org/r/132816 has been merged.
But we don't have djvutext.py, yet.
As I said above, I cannot see a use case. If one has a djvu and wishes to upload the text to wikisource, first of all he will create an Index linked to that djvu.
Once that is done, you can fetch the text with the patch above directly from the site.
No dependencies, no errors and paginations is handled by the Proofread extensions.
I think it would still be useful to have a script which batch uploaded the OCR text for the entire work using the preload functionality added by @Mpaa.
But would like to hear more active All-and-every-Wikisource contributors opinion on that.
Actually one possible use case is when Pages have been created from an old version of djvu file, then a new djvu with an improved text layer and Page is available and content needs to be overwritten. Maybe quite a remote use case ...
...or not so remote if T34695 ever approaches reality. In general, the issue of syncing Wikisource <-> their DjVu source <-> the source digital library is IMHO the biggest open question in the Wikisource model.
This is more than enWS, and more than a policy at one site.
I believe that there is still a place for the script.
I cannot speak for the internals of the Proofread application, or pywiki, so you will need to determine that level of complexity
That's bordering on laughable. I myself have blindly bot created entire thousand page+ Index:es of generally worthless OCR'd crap not knowing any better in my early WS days and can attest most everyone with any time under their belt has done the same at some point or another too. I know skipping the current deletion step needed to replace inferior OCR's text after a source file replacement would be useful regardless of how many people actually go back and do something about these 'poor decision' cases.
Change 210808 had a related patch set uploaded (by Mpaa):
Added DjVu class and djvutext.py in core
Change 224199 had a related patch set uploaded (by John Vandenberg):
Add djvulibre-bin to travis apt package list