Page MenuHomePhabricator

Export Wikisource texts in ALTO-XML format
Open, Needs TriagePublic

Description

During a recent collaboration with British Library, a request came from BL to have Wikisource texts exported to ALTO XML format or any other XML format, as convenient.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

ALTO XML seems to be an XML format designed for OCR output. It encodes the text positioning data that we do not keep in Wikitext. It's closer to the DjVu OCR format.

Dumb question: about "any other XML format", would HTML/XHTML work?

Hi, thanks for considering this! I raised this initially with Bohisattwa. I think it would need to be XML or hOCR. My idea was if transcriptions could be exported as XML it would be great to be able to ingest that into our library system so that the books become searchable through our IIIF viewer, i.e. with XML word coordinates, search terms would be highlighted in our image viewer.

@Tderrick Sadly, the Wikisource transcription system does not keep word coordinates. There might be something to do by trying to match Wikisource transcription with an OCR with coordinates to attempt to fix the OCR with the transcription but it's not an easy task at all.