Use microformats on Wiktionary to improve term parsing
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	jberkel
	Jun 26 2016, 11:36 AM

Description

(task opened as a follow-up to a conversation with @GWicke at Wikimania)

There is an experimental definition term endpoint deployed which makes some assumptions on the HTML structure of the Wiktionary markup (cf. parseDefinition.js).

To avoid future maintenance problems and to scale up the extraction to other languages / Wiktionaries it would be helpful to have the Wiktionary templates already include semantic information in the output. On Wikisource this has already been done successfully (in the context of ebook metadata export).

It would also be a good starting point for future Wikidata <–> Wiktionary integration work. There is already some minimal semantic information included in a few places, for example gender markers (cf. the Spanish entry casa):

<span class="gender">
  <abbr title="feminine gender">f</abbr>
</span>

A first step would be to formalize these conventions and implement them consistently on the English Wiktionary, then adapt the parsing code on the API side.

Thoughts?

Related Objects
Search...

Status	Subtype	Assigned	Task
Open	Feature	None	T13996 A way to select which parts of Wiktionary articles to show
Open	Feature	None	T14213 Following a link to a language entry in Wiktionary should display only that entry
Open	Feature	None	T13998 A way to show only those languages on Wiktionary that the user is interested in
Open	Feature	None	T38881 Wiktionary needs usable API
Declined		None	T151914 Support more languages in the Wiktionary definition endpoint
Declined		None	T138709 Use microformats on Wiktionary to improve term parsing

Event Timeline

jberkel created this task.Jun 26 2016, 11:36 AM

Restricted Application added subscribers: Zppix, Aklapper. · View Herald TranscriptJun 26 2016, 11:36 AM

As I wrote by email a few minutes ago, I really like the idea of Wiktionary including more semantic markup. Without it, of course, efforts like the current content service endpoint are necessarily pretty brittle and not at all scalable to other languages. Also, I wrote the existing endpoint to parse just the information needed to match the product spec created by the design team (see T114949), but of course being able to expose the entire page content in a structured way would be much more useful.

I'm going to tag the Wikipedia app as well for tracking purposes. User engagement with the Wiktionary definition popup feature has been limited so far, but we're interested in seeing how we can increase it, and along with improving discoverability, expanding it beyond English would surely help.

• Mholloway added a project: Wikipedia-Android-App-Backlog.Jun 27 2016, 2:07 PM

• Mholloway moved this task from Needs Triage to Tracking on the Wikipedia-Android-App-Backlog board.

• Mholloway awarded a token.Jun 27 2016, 2:14 PM

@Jhernandez Are Wiktionary definition popups something the web team would be interested in, if we can get the endpoint into a more robust and scalable state?

@Mholloway I'm not really sure, why are you asking?

Sounds like a good idea to me, but I'm no PO :p

Fair 'nuff. :) Just thought of you because of your interest in node services. We don't really keep such a strict engineer/PO division of labor over here in Android-land.

@Mholloway, could you document which elements you would like to see marked up? I know this is somewhat implicit in the current extraction logic, but I think it would be help move the discussion forward to have a concrete list that we could discuss with the community.

• GWicke added a parent task: T38881: Wiktionary needs usable API.Jul 22 2016, 4:43 PM

• GWicke mentioned this in T38881: Wiktionary needs usable API.Jul 22 2016, 4:46 PM

Sure, I'll do some thinking and put together a list.

• Mholloway claimed this task.Jul 22 2016, 5:43 PM

• Mholloway unsubscribed.

mxn subscribed.Jul 22 2016, 11:36 PM

Darkdadaah added a project: All-and-every-Wiktionary.Jul 25 2016, 10:15 AM

Darkdadaah subscribed.

Alkamid subscribed.Jul 29 2016, 4:34 PM

Alkamid unsubscribed.

It seems like adding microformats to identify
-language headers,
-part-of-speech headers,
-definitions, and
-examples
would be a good start.

Could we add class="wd-header", class="wd-part-of-speech" to the relevant header tags, and class="wd-definition" and class="wd-example" to the relevant <li> tags?

Edit: to incorporate @GWicke's suggestion below.

@Mholloway: Those class names are very general, which makes conflicts more likely. Perhaps consider adding a prefix, such as "wd-" (wiktionary definition)?

Good idea. I updated my comment above to reflect it.

PeterBowman subscribed.Sep 12 2016, 7:35 PM

WMDE-leszek subscribed.Sep 27 2016, 7:14 AM

• bearND moved this task from Incoming to Backlog on the Mobile-Content-Service board.Oct 3 2016, 6:11 PM

This kind of sounds like technical debt. Please drop the tag if I'm mistaken.

• Mholloway added a parent task: T151914: Support more languages in the Wiktionary definition endpoint.Nov 29 2016, 6:29 PM

Volker_E subscribed.Jan 12 2017, 11:43 PM

• bearND subscribed.Jan 13 2017, 6:56 AM

I finally managed to get some time to work on this and also did some research on microformats. In the last few years this area has become increasingly confusing with a variety of options (microformats1/2, W3C microdata, schema.org, RDFa (lite), JSON-LD etc).

In my opinion the simplest option in terms of marking up and parsing of content seems to be microformats2. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.

To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (diff).

<div class="h-usage-example">
  <span class="e-example">This is an example</span>
  <span class="e-translation">This is the translation</span>
</div>

This allows one to write a simple extractor in a few lines of Python, with the help of the mf2py parser:

import mf2py

obj = mf2py.parse(url='https://en.wiktionary.org/wiki/fazer', html_parser='html5lib')
examples = (example['value']
    for item in obj.get('items', []) if 'h-usage-example' in item['type']
    for example in item['properties']['example'])

for example in examples:
    print(example)

• GWicke added a subscriber: Lydia_Pintscher.Feb 23 2017, 3:08 AM

Follow up here today based on conversations at the colab jam: Please be aware of the work the Wikidata team is doing on supporting lexicographical data in Wikidata as part of our Wiktionary support work. It will still take some time to get that finished but it'll give you nice machine-readable data like it is in Wiktionary now.

@Lydia_Pintscher I'm aware of the efforts of the Wikidata team, it is great to see that this is happening. The approach present here is meant to be a temporary solution until we have this data. Then there's also the chicken-egg question: we first need to get the data present on Wiktionary into Wikidata. This task will be a lot easier if we already have some semantic information present in the generated output, it would let us automate that process. That's what I meant in the initial task description:

It would also be a good starting point for future Wikidata <–> Wiktionary integration work.

I'm still unsure about how this aspect of the transition to Wikidata will work out in practice, what are the current ideas around this?

Just like for the rest of the data in Wikidata editors will handle it via manual entry, bots and other tools.

jberkel mentioned this in T17017: Wikimedia static HTML dumps broken.Mar 15 2017, 8:07 AM

@Lydia_Pintscher OK, so making Wiktionary easier to parse right now will help with that transition. It will be great to have at least some of the data easily accessible.

In T138709#3048681, @jberkel wrote:

In my opinion the simplest option in terms of marking up and parsing of content seems to be microformats2. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.

To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (diff).

Thanks for the on-wiki work. This has a value mostly if adopted or adoptable by multiple subdomains: have you written to the grease pit and Wiktionary-l about this work? Since microformats2 is very generic, it would be useful to start writing down your "spec" on a Meta-Wiki page, so that other Wiktionary users can more easily comment (and adopt) it.

(Note: since this is about wiki editing work, the main component would be WMF-General-or-Unknown once accepted by the editors.)

No I haven't – this was just a first initial test / proof of concept. To me at least it has proven useful, I can now extract usage examples quite easily from the HTML output of the templates, provided that they actually get used (Wiktionary has many cases where templates are recommended but are in fact optional).

I'll start a discussion but it sometimes feels like a “touchy” subject, there's no clear consensus and some editors don't see the value of semantic data and prefer fewer keystrokes or “more legible markup” as they put it. I won't give up so easily though in my quest to persuade them.

Well, editors surely have less concerns as long as you change the HTML output while keeping wikitext identical.

Yes, that's the idea, editors wouldn't even notice the fact that extra markup gets generated. However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.

However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.

The existing templates are widely accepted, so I think there will be a rather natural push in that direction once editors can "touch" the benefits. (It's also understandable to keep some flexibility, since making a dictionary requires involvement of many people.)

• NHarateh_WMF added a project: Product-Infrastructure-Team-Backlog-Deprecated.Apr 25 2017, 12:27 PM

• NHarateh_WMF moved this task from Needs triage to Needs investigation on the Product-Infrastructure-Team-Backlog-Deprecated board.Apr 25 2017, 12:31 PM

• NHarateh_WMF moved this task from Backlog to Incoming on the Mobile-Content-Service board.Apr 25 2017, 4:33 PM

• Mholloway removed • Mholloway as the assignee of this task.May 12 2017, 2:38 AM

• Mholloway subscribed.

Kelson subscribed.Jul 9 2017, 12:54 PM