Page MenuHomePhabricator

Add scraping of Eprints tags to html-metadata node library
Closed, ResolvedPublic

Description

We use the node.js html-metadata library to scrape metadata from webpages. This metadata is then used to generate a citation for the webpage using citoid. Editors can use this citation when they're writing or editing an article on Wikipedia as a reference.

For this you will need to install node and npm: https://www.joyent.com/blog/installing-node-and-npm/.

You will also need to fork the html-metadata library on github: https://help.github.com/articles/fork-a-repo/

Eprints tags are a type of metadata in a webpage that gives information about how to cite the resource. There is a good explanation about them here: https://www.google.com/intl/en/scholar/inclusion.html#indexing

Here is an example of a page with eprints metadata: view-source:http://eprints.rclis.org/13306/

An example of an eprints tag from that page:

<meta name="eprints.title" content="Die Informationsfachpersonen in einem digitalisierten Umfeld" />

You should create a function called exports.parseEprints in https://github.com/wikimedia/html-metadata/blob/master/lib/index.js that scrapes all the eprints tags, and creates a javascript object. This object should contain all of the tag names as keys (excluding the eprints. prefix) and contains the content of the tag as the value. If there are multiple values for the same tag name, these should be in an Array. You should also register the method in https://github.com/wikimedia/html-metadata/blob/master/index.js

You should also create tests for your new function. You can put tests of a live website in https://github.com/wikimedia/html-metadata/blob/master/test/scraping.js. You can also create tests of a static website in https://github.com/wikimedia/html-metadata/blob/master/test/static.js. You may find it helpful to write these tests early on in writing your function, as they will help you verify your function is working correctly. Tests can be run by doing

npm test