Page MenuHomePhabricator

Use microformats on Wiktionary to improve term parsing
Closed, DeclinedPublic

Description

(task opened as a follow-up to a conversation with @GWicke at Wikimania)

There is an experimental definition term endpoint deployed which makes some assumptions on the HTML structure of the Wiktionary markup (cf. parseDefinition.js).

To avoid future maintenance problems and to scale up the extraction to other languages / Wiktionaries it would be helpful to have the Wiktionary templates already include semantic information in the output. On Wikisource this has already been done successfully (in the context of ebook metadata export).

It would also be a good starting point for future Wikidata <–> Wiktionary integration work. There is already some minimal semantic information included in a few places, for example gender markers (cf. the Spanish entry casa):

<span class="gender">
  <abbr title="feminine gender">f</abbr>
</span>

A first step would be to formalize these conventions and implement them consistently on the English Wiktionary, then adapt the parsing code on the API side.

Thoughts?

Event Timeline

As I wrote by email a few minutes ago, I really like the idea of Wiktionary including more semantic markup. Without it, of course, efforts like the current content service endpoint are necessarily pretty brittle and not at all scalable to other languages. Also, I wrote the existing endpoint to parse just the information needed to match the product spec created by the design team (see T114949), but of course being able to expose the entire page content in a structured way would be much more useful.

I'm going to tag the Wikipedia app as well for tracking purposes. User engagement with the Wiktionary definition popup feature has been limited so far, but we're interested in seeing how we can increase it, and along with improving discoverability, expanding it beyond English would surely help.

@Jhernandez Are Wiktionary definition popups something the web team would be interested in, if we can get the endpoint into a more robust and scalable state?

@Mholloway I'm not really sure, why are you asking?

Sounds like a good idea to me, but I'm no PO :p

Fair 'nuff. :) Just thought of you because of your interest in node services. We don't really keep such a strict engineer/PO division of labor over here in Android-land.

@Mholloway, could you document which elements you would like to see marked up? I know this is somewhat implicit in the current extraction logic, but I think it would be help move the discussion forward to have a concrete list that we could discuss with the community.

Sure, I'll do some thinking and put together a list.

It seems like adding microformats to identify
-language headers,
-part-of-speech headers,
-definitions, and
-examples
would be a good start.

Could we add class="wd-header", class="wd-part-of-speech" to the relevant header tags, and class="wd-definition" and class="wd-example" to the relevant <li> tags?

Edit: to incorporate @GWicke's suggestion below.

@Mholloway: Those class names are very general, which makes conflicts more likely. Perhaps consider adding a prefix, such as "wd-" (wiktionary definition)?

Good idea. I updated my comment above to reflect it.

Niedzielski subscribed.

This kind of sounds like technical debt. Please drop the tag if I'm mistaken.

I finally managed to get some time to work on this and also did some research on microformats. In the last few years this area has become increasingly confusing with a variety of options (microformats1/2, W3C microdata, schema.org, RDFa (lite), JSON-LD etc).

In my opinion the simplest option in terms of marking up and parsing of content seems to be microformats2. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.

To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (diff).

<div class="h-usage-example">
  <span class="e-example">This is an example</span>
  <span class="e-translation">This is the translation</span>
</div>

This allows one to write a simple extractor in a few lines of Python, with the help of the mf2py parser:

import mf2py

obj = mf2py.parse(url='https://en.wiktionary.org/wiki/fazer', html_parser='html5lib')
examples = (example['value']
    for item in obj.get('items', []) if 'h-usage-example' in item['type']
    for example in item['properties']['example'])

for example in examples:
    print(example)

Follow up here today based on conversations at the colab jam: Please be aware of the work the Wikidata team is doing on supporting lexicographical data in Wikidata as part of our Wiktionary support work. It will still take some time to get that finished but it'll give you nice machine-readable data like it is in Wiktionary now.

@Lydia_Pintscher I'm aware of the efforts of the Wikidata team, it is great to see that this is happening. The approach present here is meant to be a temporary solution until we have this data. Then there's also the chicken-egg question: we first need to get the data present on Wiktionary into Wikidata. This task will be a lot easier if we already have some semantic information present in the generated output, it would let us automate that process. That's what I meant in the initial task description:

It would also be a good starting point for future Wikidata <–> Wiktionary integration work.

I'm still unsure about how this aspect of the transition to Wikidata will work out in practice, what are the current ideas around this?

Just like for the rest of the data in Wikidata editors will handle it via manual entry, bots and other tools.

@Lydia_Pintscher OK, so making Wiktionary easier to parse right now will help with that transition. It will be great to have at least some of the data easily accessible.

In my opinion the simplest option in terms of marking up and parsing of content seems to be microformats2. They are based around prefix classes which can be mapped easily to JSON types with generic parsers available for different languages.

To test this out I added microformats2-compatible classes to usage examples rendered on Wiktionary (diff).

Thanks for the on-wiki work. This has a value mostly if adopted or adoptable by multiple subdomains: have you written to the grease pit and Wiktionary-l about this work? Since microformats2 is very generic, it would be useful to start writing down your "spec" on a Meta-Wiki page, so that other Wiktionary users can more easily comment (and adopt) it.

(Note: since this is about wiki editing work, the main component would be WMF-General-or-Unknown once accepted by the editors.)

No I haven't – this was just a first initial test / proof of concept. To me at least it has proven useful, I can now extract usage examples quite easily from the HTML output of the templates, provided that they actually get used (Wiktionary has many cases where templates are recommended but are in fact optional).

I'll start a discussion but it sometimes feels like a “touchy” subject, there's no clear consensus and some editors don't see the value of semantic data and prefer fewer keystrokes or “more legible markup” as they put it. I won't give up so easily though in my quest to persuade them.

Well, editors surely have less concerns as long as you change the HTML output while keeping wikitext identical.

Yes, that's the idea, editors wouldn't even notice the fact that extra markup gets generated. However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.

However it would also mean to promote the usage of templates wherever possible, and to possibly automate the conversion of non-templated content with bots.

The existing templates are widely accepted, so I think there will be a rather natural push in that direction once editors can "touch" the benefits. (It's also understandable to keep some flexibility, since making a dictionary requires involvement of many people.)

Jhernandez lowered the priority of this task from Medium to Lowest.Feb 20 2019, 4:42 PM
Jhernandez raised the priority of this task from Lowest to Low.
LGoto closed this task as Declined.Oct 9 2020, 4:50 PM