To add some basic djvulibre calls to API
Open, LowPublic

Description

Djvu files, the main image+text multipage file used by Wikisource projects, has a fairly robust text layer but mediawiki can't fully access it (except for rough extraction of plain text). Some new API actions, added both to Commons and to Wikisource projects' API, would allow retrieval of more complex text layers and/or interesting [meta]data from the whole file or for selected pages.

While reading functions are safe, writing functions can be destructive, even if they could be very useful to advanced users; so I think that first step would be to implement only read-only functions.

  • djvutext (to read structured text in lisp-like syntax) and
  • djvutoxml (to extract structured text in xml)

would be IMHO the first two routines to implement.


Version: unspecified
Severity: enhancement

Details

Reference
bz57351
bzimport raised the priority of this task from to Low.
bzimport set Reference to bz57351.
bzimport added a subscriber: Unknown Object (MLST).
brion added a comment.Nov 21 2013, 4:29 PM

It would be best if possible to have a common interface for multiple file types; PDF also can embed text for instance.

What sort of data format are you envisioning, and what sort of uses?

It would be best if possible to have a common interface for multiple file types; PDF also can embed text for instance.

What sort of data format are you envisioning, and what sort of uses?

See T59807 for one possibility re: djvutoxml

GOIII updated the task description. (Show Details)Dec 13 2014, 12:14 PM
GOIII added a project: MediaWiki-DjVu.
GOIII set Security to None.
jayvdb added a subscriber: jayvdb.Feb 10 2015, 8:50 PM

FYI - New version of DjVuLibre & related released 2/12/2015

http://sourceforge.net/projects/djvu/files/?source=navbar

jayvdb added a subscriber: Tpt.Feb 21 2015, 11:42 PM

Could this be written up as a Outreachy / #GSoC2015 project? @Tpt, could you be interested in helping as a mentor?

Tpt added a comment.Feb 22 2015, 4:57 PM

I'm interested in helping but maybe not as primary mentor (I have no deep knowledge of DjVu and MediaWiki file system)

Mpaa added a subscriber: Mpaa.Apr 19 2015, 4:58 PM
Mpaa added a comment.Apr 21 2015, 6:28 PM

Until recently, text layer was accessible via API imageinfo+iiprop=metadata.
Now metadata returns null.

Compare https://commons.wikimedia.org/w/api.php?action=query&titles=File:Popular_Science_Monthly_Volume_20.djvu&prop=imageinfo&iiprop=metadata&rawcontinue with https://commons.wikimedia.org/w/api.php?action=query&titles=File:Popular_Science_Monthly_Volume_55.djvu&prop=imageinfo&iiprop=metadata&rawcontinue.

I support a way to access text layer via API or similar method.
It would be useful, e.g., for Wikisource related bots.