Page MenuHomePhabricator

Add scraping of BE Press tags to html-metadata node library
Closed, ResolvedPublic

Description

We use the node.js html-metadata library to scrape metadata from webpages. This metadata is then used to generate a citation for the webpage using citoid. Editors can use this citation when they're writing or editing an article on Wikipedia as a reference.

For this you will need to install node and npm: https://www.joyent.com/blog/installing-node-and-npm/.

You will also need to fork the html-metadata library on github: https://help.github.com/articles/fork-a-repo/

BE Press tags are a type of metadata in a webpage that gives information about how to cite the resource. There is a good explanation about them here: https://www.google.com/intl/en/scholar/inclusion.html#indexing

Here's an example of a page with BE Press metadata: view-source:http://uknowledge.uky.edu/upk_african_history/1/

Here's a sample tag from that page:

<meta name="bepress_citation_title" content="South Africa and the World: The Foreign Policy of Apartheid">

You should create a function called exports.parseBEPress in https://github.com/wikimedia/html-metadata/blob/master/lib/index.js that scrapes all the BE Press tags, and creates a javascript object. This object should contain all of the tag names as keys (excluding the bepress_citation prefix) and contains the content of the tag as the value. If there are multiple values for the same tag name, these should be in an Array. You should also register the method in https://github.com/wikimedia/html-metadata/blob/master/index.js

You should also create tests for your new function. You can put tests of a live website in https://github.com/wikimedia/html-metadata/blob/master/test/scraping.js. You can also create tests of a static website in https://github.com/wikimedia/html-metadata/blob/master/test/static.js. You may find it helpful to write these tests early on in writing your function, as they will help you verify your function is working correctly. Tests can be run by doing

npm test