harvest microformats
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	jayvdb
	Dec 12 2014, 7:18 PM

Description

Some data in Wikipedia is easier to extract from the rendered html than from the templates, and it puts the values into microformats. There may also be other webpages which use microformats which could be used to extract information and add it to wikidata. I expect this should be done in a new script, but it would be based on script harvest_templates.py

https://en.wikipedia.org/wiki/Help:Microformats .

birthdate and deathdate are good examples, where on English Wikipedia they are placed in special spans, using a constant format.

view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin

The {{Persondata}} template is relatively easy to parse the template, but it is also well labelled in the HTML. https://en.wikipedia.org/wiki/Wikipedia:Persondata

<table id="persondata" class="persondata noprint" style="border:1px solid #aaa; display:none; speak:none;">
<tr>
<th colspan="2"><a href="/wiki/Wikipedia:Persondata" title="Wikipedia:Persondata">Persondata</a></th>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Name</td>
<td>Franklin, Benjamin</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Alternative names</td>
<td></td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Short description</td>
<td>American printer, writer, politician</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Date of birth</td>
<td>January 17, 1706</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Place of birth</td>
<td>Boston, Massachusetts</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Date of death</td>
<td>April 17, 1790</td>
</tr>
<tr>
<td class="persondata-label" style="color:#aaa;">Place of death</td>
<td><a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>, Pennsylvania</td>
</tr>
</table>

More at https://en.wikipedia.org/wiki/Wikipedia:Metadata

A list of templates which generate microformats is at https://en.wikipedia.org/wiki/Category:Templates_generating_microformats , and sample pages can be found by using 'whatlinkshere'.

e.g. vcard with fn org can be seen in the source of the infobox here:

view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal

Event Timeline

jayvdb created this task.Dec 12 2014, 7:18 PM

jayvdb raised the priority of this task from to Needs Triage.

jayvdb updated the task description. (Show Details)

jayvdb added a project: Pywikibot-Wikidata.

jayvdb changed Security from none to None.

jayvdb subscribed.

This has been proposed as CGI task https://www.google-melange.com/gci/task/view/google/gci2014/5857599308169216

The parsed version of a wiki page can be obtained using the parse module
https://en.wikipedia.org/w/api.php?action=parse&page=Benjamin_Franklin

jayvdb added a project: Google-Code-in-2014.Dec 20 2014, 4:30 AM

jayvdb moved this task from Backlog to pywikibot on the Google-Code-in-2014 board.

Copying the GCI task description:

This task is to create a new script to harvest data from HTML microformats in Wikipedia pages and other webpages, and add the data to items in Wikidata. The new script will be similar in nature to the existing script harvest_template.py, except it will use HTML instead of wikitext, and it can offer automatic assignments of values to properties where the microformats describe the data in a standardised way that maps to properties on Wikidata.

I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.

Did I get it right?

In T78416#938277, @murfel wrote:

I think to implement it in the following way: catch all page which link to a given template, get HTML for each page, look for table with id="template_name" inside of HTML, parse key-values in the table and add them to Wikibase.

Did I get it right?

maybe, but maybe not. My inclusion of {{Persondata}} as an example was perhaps misleading.

this harvest_microformats script should not be based on templates, as is the job of harvest_template.py .

This script will use pagegenerators as arguments to select which pages should be processed, and -page:"..." is the easiest to use for testing.

For each page, get the HTML as you've said, and look for microformats (http://microformats.org/) in the HTML. Microformats are usually described using HTML class:".." attributes, such as:

view-source:https://en.wikipedia.org/wiki/Benjamin_Franklin

and

view-source:https://en.wikipedia.org/wiki/Manchester_Ship_Canal

<th colspan="2" class="fn org" style="text-align:center;font-size:125%;font-weight:bold;font-size: larger; background-color: #CEDEFF">Manchester Ship Canal</th>

The two most important standardised microformats are
http://microformats.org/wiki/hCard
http://microformats.org/wiki/hCalendar

Another icroformat that is very relevant to wikis is http://microformats.org/wiki/rel-license

However Wikimedia mostly uses its own non-standard microformats, for example, "licensetpl" is used by Wikisource and Wikimedia Commons instead of rel-license

view-source:https://en.wikisource.org/wiki/The_Clipper_Ship_Era

<table class="licensetpl" style="display:none;">
<tr>
<td><span class="licensetpl_short">Public domain</span><span class="licensetpl_long">Public domain</span><span class="licensetpl_link_req">false</span><span class="licensetpl_attr_req">false</span></td>
</tr>
</table>

When microformats have been found in the HTML, yes .... "parse key-values [from the microformat] and add them to Wikibase" , but .. there are python libraries that already do most of the grunt work for you, so hopefully you dont need to do the parsing yourself, e.g. see http://microformats.org/wiki/parsers and search https://pypi.python.org/pypi/ . One library mentioned is https://github.com/tommorris/mf2py , which is maintained by @tommorris , English Wikipedia admin among other things.

A useful tool to see microformats in any webpage (courtesy of @murfel on IRC)
https://mf2py.herokuapp.com/parse?url=https://en.wikipedia.org/wiki/Benjamin_Franklin

The version packaged on pypi (0.2.1) doesnt handle wikipedia microformats. You will need to install the latest from github : https://github.com/tommorris/mf2py

You would probably need bs4 4.2.1 (depends on python >= 2.7.5-5~) so mf2py works correctly.

You can start with or take a look at my code: http://pastebin.com/acLfCrfD

• Prtksxna subscribed.Jan 5 2015, 7:37 AM

Edgars2007 subscribed.Jul 12 2016, 9:45 AM

• Phabricator_maintenance added a project: Pywikibot.Mar 23 2019, 10:15 PM

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptMar 23 2019, 10:15 PM

harvest microformatsOpen, Needs TriagePublicActions

Description

Event Timeline

harvest microformats
Open, Needs TriagePublic
Actions