Page MenuHomePhabricator

Add an OAI-PMH API for index page
Closed, ResolvedPublic


We need a good API to export metadata of index pages in a standardized and interoperable format in order to share them with GLAM community. They use the OAI-MPH protocol . I think that it may be a good idea to add this feature to Proofreadpage. I'm writing a first version of this feature (demo on labs : ).

To add this feature I've written a small metadata typing system. It's a very big addition to ProofreadPage and it need to rewrite the configuration of index pages (bug 37419) in all Wikisources. I suggest to test this big feature in one Wikisource during the needed time of stabilisation before a deployment on all Wikisources when a Special page to help to configure pfp will be done (bug 37839).

Version: unspecified
Severity: enhancement



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 12:49 AM
bzimport added a project: ProofreadPage.
bzimport set Reference to bz38498.

I'm familiar with OAI-PMH, but could you lay out a brief explanation of how you propose exposing data via this protocol and how to manage it?

(I'm the author of the existing OAI extension for MediaWiki, which we've used in the past for some whole-wiki mirrors and today mostly for updating our search indexes.)

The data are the content of the pages of the Index namespace. This index namespace is manage by ProofreadPage in order to store metadata about scan that are proofread in a template (Mediawiki:Proofreadpage_index_template). (example: ). The OAI-PMH API is done to expose these metadata. The metadata are the standard bibliographic data (title, author...) + data related to the proofreading (number of pages proofreads in the book...). We can imagine sets build with the categories that regroup index pages.

I've improve the configuration system of the index pages ([[MediaWiki:Proofreadpage data config]]) in order to say: this entry in the index template is this in a list of properties known by ProofreadPage (title, author, publisher, identifier...), has this type (string, page link, number, LCCN...) and multiple values are separated by these strings ('; ', ' and '...).

With this configuration Proofreadpage get values of the entries of the index template, split them (with the strings listed in the config) try if they respect the type and expose them throw the API. I've implemented simple Dublin Core but I would like also provide data with a more efficient system.

This would be a hugely useful addition. At current I'm employed by OCLC who run WorldCat [1]. If this OAI-PMH were to become active on Wikisource, it would be possible for WorldCat start cataloguing Wikisource. This would be great boon to both systems because WorldCat would be more useful in showing Free Full Text versions, and Wikisource can tap into the big traffic that WorldCat pulls. WorldCat already supports harvesting via OAI-PMH so this would be very low-hanging fruit.

(In reply to comment #4)

Patch uploaded:

Let me know when this is live TPT and I will start harvesting.

(In reply to comment #6)

(In reply to comment #4)

Patch uploaded:

Status Merged

I've copied the available information to [[mw:Extension:Proofread_Page#OAI-PMH]], please someone add more.

(In reply to comment #5)

(In reply to comment #4)

Patch uploaded:

Let me know when this is live TPT and I will start harvesting.

It will presumably go live for Wikisources with wmf4 on Wednesday, November 28 (let us know what's missing for this bug report or additional ones to be considered closed).

(In reply to comment #8)

(let us know what's
missing for this bug report or additional ones to be considered closed).

Tpt: Could you answer this? Or can this report be closed as FIXED?

Yes, all most important features are now deployed and works fine. I close as FIXED.

(For archive happiness: this was removed in 4932396cf4be)