Page MenuHomePhabricator

harvest microformats
Open, Needs TriagePublic

Description

Some data in Wikipedia is easier to extract from the rendered html than from the templates, and it puts the values into microformats. There may also be other webpages which use microformats which could be used to extract information and add it to wikidata. I expect this should be done in a new script, but it would be based on script harvest_templates.py

https://en.wikipedia.org/wiki/Help:Microformats .

birthdate and deathdate are good examples, where on English Wikipedia they are placed in special spans, using a constant format.

view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin

<span class="bday">1706-01-17</span>
<span class="dday deathdate">1790-04-17</span>

The {{Persondata}} template is relatively easy to parse the template, but it is also well labelled in the HTML. https://en.wikipedia.org/wiki/Wikipedia:Persondata

<table id="persondata" class="persondata noprint" style="border:1px solid #aaa; display:none; speak:none;">
<tr>
<th colspan="2"><a href="/wiki/Wikipedia:Persondata" title="Wikipedia:Persondata">Persondata</a></th>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Name</td>
<td>Franklin, Benjamin</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Alternative names</td>
<td></td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Short description</td>
<td>American printer, writer, politician</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Date of birth</td>
<td>January 17, 1706</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Place of birth</td>
<td>Boston, Massachusetts</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Date of death</td>
<td>April 17, 1790</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Place of death</td>
<td><a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>, Pennsylvania</td>
</tr>
</table>

More at https://en.wikipedia.org/wiki/Wikipedia:Metadata

A list of templates which generate microformats is at https://en.wikipedia.org/wiki/Category:Templates_generating_microformats , and sample pages can be found by using 'whatlinkshere'.

e.g. vcard with fn org can be seen in the source of the infobox here:

view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal

Event Timeline

jayvdb raised the priority of this task from to Needs Triage.
jayvdb updated the task description. (Show Details)
jayvdb added a project: Pywikibot-Wikidata.
jayvdb changed Security from none to None.
jayvdb subscribed.

Copying the GCI task description:

This task is to create a new script to harvest data from HTML microformats in Wikipedia pages and other webpages, and add the data to items in Wikidata. The new script will be similar in nature to the existing script harvest_template.py, except it will use HTML instead of wikitext, and it can offer automatic assignments of values to properties where the microformats describe the data in a standardised way that maps to properties on Wikidata.

I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.

Did I get it right?

I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.

Did I get it right?

maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps misleading.

this harvest_microformats script should not be based on templates, as is the job of harvest_template.py .

This script will use pagegenerators as arguments to select which pages should be processed, and -page:"..." is the easiest to use for testing.

For each page, get the HTML as you've said, and look for microformats (http://microformats.org/) in the HTML. Microformats are usually described using HTML class:".." attributes, such as:

view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin

<span class="bday">1706-01-17</span>
<span class="dday deathdate">1790-04-17</span>

and

view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal

<th colspan="2" class="fn org" style="text-align:center;font-size:125%;font-weight:bold;font-size: larger; background-color: #CEDEFF">Manchester Ship Canal</th>

The two most important standardised microformats are
http://microformats.org/wiki/hCard
http://microformats.org/wiki/hCalendar

Another icroformat that is very relevant to wikis is http://microformats.org/wiki/rel-license

However Wikimedia mostly uses its own non-standard microformats, for example, "licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license

view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era

<table class="licensetpl" style="display:none;">
<tr>
<td><span class="licensetpl_short">Public domain</span><span class="licensetpl_long">Public domain</span><span class="licensetpl_link_req">false</span><span class="licensetpl_attr_req">false</span></td>
</tr>
</table>

When microformats have been found in the HTML, yes .... "parse key-values [from the microformat] and add them to Wikibase" , but .. there are python libraries that already do most of the grunt work for you, so hopefully you dont need to do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and search https://pypi.python.org/pypi/ . One library mentioned is https://github.com/tommorris/mf2py , which is maintained by @tommorris , English Wikipedia admin among other things.

The version packaged on pypi (0.2.1) doesnt handle wikipedia microformats. You will need to install the latest from github : https://github.com/tommorris/mf2py

You would probably need bs4 4.2.1 (depends on python >= 2.7.5-5~) so mf2py works correctly.

You can start with or take a look at my code: http://pastebin.com/acLfCrfD