Page MenuHomePhabricator

[EPIC] Representing / extracting wiki-specific application-level semantics
Open, MediumPublic

Description

@ssastry wrote:

TL:DR;

  1. Applications have a need for identifying wiki-specific semantics as encapsulated in the template namespace and their use in articles.
  2. Parsoid cannot provide that information as it exists today --- well, it can in a crude sort of way by adding an analysis layer that upstreams all the custom analyses / hacks that various applications are currently doing on their own. There is some benefit to that in that there is a centralized location where such analyses can be located and improved and built upon without every application inventing their own set of tools each time.
  3. Shadow namespaces / Global namespaces, TemplateData, in-template annotations, future-proposed-types, wikidata are some existing / proposed mechanisms that seem to solve some subset of problems.
  4. There is the other axis of orthogonal approaches: explicit typing / annotation VS type-inference / machine-learning for representing/extracting some of these semantics from the existing corpus.

Thought dump:

Parsoid attempts to expose wikitext semantics in the HTML it generates by marking up things like templates, exposing template args in there, marking up extension output (references, galleries, etc), links, etc. But, at this level, the semantics are pretty low-level and only useful upto a point.

From what I have noticed over the last couple years, applications that need to analyze wikitext need semantic information at an application level that is not directly available at this low level semantic markup that Parsoid provides. It seems like it is a layer that is built on top.

In the past, Google asked us about this wrt templates. Content Translation encountered this when they had to adapt templates across wikis because of translations and had asked us if there is any way Parsoid could provide assistance there. Most recently, Reading ran into this problem when they were trying to do some transformations of Parsoid content for display in a mobile context (and I warned them that some of their transformations and analyses is very wiki-specific and won't generalize). And, yesterday, in a conversation with Leila, I learnt about https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_stubs_across_languages#What_is_a_stub.3F .. and reading that, I see a similar problem that they run into wrt figuring which templates capture the notion of stubs that works across wikis. I vaguely have the sense that in some converations with Aaron, something similar had popped up.

By adding <section> wrappers to Parsoid markup (which Mobile does in their layer right now but will remove once Parsoid's implementation goes live), Parsoid is moving a tiny step beyond that, but only marginally because the notion of a section is a wikitext-centric notion that is wiki-agnostic. A generalization of this problem came up in the context of T105845: RFC: Page components / content widgets but my general observation about that is captured in T105845#1650089 .. i.e. I think this is not a Parsoid specific problem, but a problem that is one layer above Parsoid.

The TL:DR; of this prologue is that above applications are looking for is a way of capturing wiki-specific / domain-specific semantics that are represented in templates (their names, their categorization, how they are used and are expected to be used, and in combination with which other templates. etc). Shadow templates / global template namespaces goes one step along this direction and can solve needs of some applications (or may be go a long way).

But, it seems to be that there is a point here where it might be worthwhile thinking of this problem a bit more generically, i.e. is there a separate mechanism / meta layer that can help capture this kind of knowledge / information that will address application needs of multiple applications? TemplateData is one such mechanism, for example, wherein editors are providing additional information to editing tools (VE, Parsoid). In the context of T105845 (same ticket referenced earlier), we pondered (ab)using TemplateData for it, but it felt a bit hackish to me.

Separately, on the Parsing team, as we get closer to implementing well-formed output ("balanced") semantics for templates, we are considering other meta-level tagging information via in-source annotations in templates. As someone who is interested in languages, I think of that problem more generally as a notion of introducing rich type annotations for templates that convey information not just about well-formedness, but as a symbol that captures user-specific / custom metadata and semantics (including things like custom JS resources, custom CSS resources like template styles) that can be captured / represented separately.

And, there is wikidata, of course, which solves problems like inter-language links pretty elegantly. How generalizable is that solution? i.e. can we represent wiki-specific concepts (stub, infobox, etc.) via wikidata item, and extract additional semantics by analyzing relationships between items?

But, there is another way of looking at this problem entirely (from the static types / automatic type inference lens OR the machine learning / AI lens), i.e. instead of having explicit annotations (types, templatedata, template annotations, DSLs, whatever), another way of doing the same would be to extract features from analyzing the corpus.

Questions:

Okay, so, that is an off-the-cuff survey of the problem space based on disjoint conversations that I have had with folks over the years. Firstly, I wanted to sound out if there is any resonance with what I am laying out here. If so, secondly, does it make sense to think about this problem space a bit more systematically to see what problems can be addressed with what approaches? For example, I don't think there is a generic single approach that will work for all scenarios, nor is this likely to eliminate the need for custom application-specific analyses. Given that, thirdly, if there is any coherence that can be found in this problem space, what set of approaches and mechanisms do we want to consider and build that work in a way that enhances the value collectively vs. building independent solutions that might work at cross-purposes?

These are pretty nascent thoughts now, but I figured I would sound this out amongst those of you who I know have grappled with different pieces at different times in your work, since I think there is something here worth further exploration. Feel free to fwd to anyone who might have more insight into this since I don't necessarily have solutions here, mostly questions. :-)

@Fjalapeno replied:

Thanks for writing this up…

I can say that, yes, we have been looking for ways to reliably make sense of semantics of the content in pages for quite some time. If templates didn’t exist, we could probably just analyze the wikitext itself. But with all the transcluded content, we are basically left to writing heuristics that process the output HTML. This is less than ideal, but we could probably get away with this if templates were consistent across projects.

As you noted, for the Reading clients, we have essentially been migrating our heuristics out of the client apps and into middleware services (MCS). While the techniques we use are sometimes brittle, having them centralized and on the server allows us to react quickly to problems and solve them for all clients at once.

As far as a solution, I am not sure where to go. I may be naive here, but part of me thinks this is a community problem more than a tech problem, like standardizing MediaWIki Installs and templates across projects would go a long way here and allow us to write heuristics that always work (I think). Long term, heuristics are obviously not a great solution… so being able to access this information as structured data would be the way to go. How that is accomplished I don’t know, but would be very interested in helping to find out.

@ssastry replied:

If templates didn’t exist, we could probably just analyze the wikitext itself. But with all the transcluded content, we are basically left to writing heuristics that process the output HTML. This is less than ideal, but we could probably get away with this if templates were consistent across projects.

Templates enable you to systematically represent structured content in pages. Without templates, how would know which blob of html on a page is an infobox, or a navbox or a data table or a chemical element information or article quality info (stub, qa, fa, etc.) or curation information (draft, citation needed, etc.) and so on? Templates and their specific pattern of usage expose higher-level concepts about a page (infobox), project (all chemical element pages use templates X, Y, Z), or a wiki (all pages on this page that meet certain quality standards are flagged with template FA to indicated featured-article status).

Guillaume started a project to classify templates (see https://meta.wikimedia.org/wiki/Templates/Taxonomy and https://meta.wikimedia.org/wiki/Templates/Taxonomy/Most_transcluded_templates/wikipedia/en/2015-01-05 for example). One of the questions I am asking is how do you represent and/or extract this (and other kinds of) higher-level conceptual information out of the system of templates in use?

As you noted, for the Reading clients, we have essentially been migrating our heuristics out of the client apps and into middleware services (MCS). While the techniques we use are sometimes brittle, having them centralized and on the server allows us to react quickly to problems and solve them for all clients at once.

Makes sense.

As far as a solution, I am not sure where to go. I may be naive here, but part of me thinks this is a community problem more than a tech problem, like standardizing MediaWIki Installs and templates across projects would go a long way here and allow us to write heuristics that always work (I think). Long term, heuristics are obviously not a great solution… so being able to access this information as structured data would be the way to go. How that is accomplished I don’t know, but would be very interested in helping to find out.

Consistency (in the sense of using same / similar templates) is unlikely since different wikis have different ways of doing things (and I think that is quite reasonable), but global / shadow namespaces could at least do away with unnecessary / trivial variations.

There are different solutions / strategies for different problems ... and I go back to my earlier question of which of these need to be worked together a bit more coherently: annotations for balanced templates, templatedata annotations for edit tooling, or some sort of generic typing mechanism which can be a container for annotations, resources, handlers, higher-level wiki-specific concepts in one place (maybe an expanded / refined notion of templatedata is such a mechanism).

@GWicke replied:

thanks for starting this conversation.

I think we all would like to make it easier to identify and work with semantically important content elements like infoboxes, navboxes, reference lists, pronunciation guides, and so on. Currently, Parsoid reports relatively low-level information on template or extension names and parameters, and leaves it to individual clients to handle the complexity of cross-project variations. This made sense for the initial focus on reading, but we can and should evolve this now that Parsoid HTML is used increasingly for reading use cases and analytics.

In the recent design conversation about reading-optimized HTML, I proposed to try marking up such semantic elements in MCS, and handling destructive processing (such as removal of some of those elements for mobile use cases) in a separate layer, based on the the added semantic markup. The idea is that we can learn a lot about what it takes to mark up specific semantic elements across projects, and eventually migrate some or all of those transforms into the standard HTML we expose, display, and edit.

There are many questions about how we can generalize this to many projects, how we can make sure that we cover all important use cases, and how things should be implemented in detail. You already bring up many options for the multi-project problem. Several options boil down to maintaining a mapping of template names (and possibly parameters) to semantic elements. Another approach is to ask template authors to add specific markup (such as classes / microformats) in their template output. The latter approach was proposed by community members for wiktionary: T138709: Use microformats on Wiktionary to improve term parsing. Either way, such mappings should probably be maintained by the community, either by editing a mapping / template metadata, or by editing the templates themselves.

The second implementation question is where the classification should be handled: Directly in Parsoid, or in a separate service like MCS that is layered on top. I think either can work, but given that we would like to use the same HTML for editing, it might be more reliable to perform and more importantly test this in Parsoid itself.

@cscott replied:

It seems to me that there are two features here: one -- encoding the semantic information latent in a template inclusions (or whatever), two -- transporting this across languages and projects.

Some architectural solutions would (potentially) provide one as a byproduct of the other.

For example, the idea of global / shadow namespaces (T66475, T91162), etc, would allow all of the semantic information to be handled as (for example) inclusion from the global namespace. For example, (in theory) all wikis would now use the global Template:Infobox and Template:Stub, and so we could identify them easily. On eswiki, perhaps they use [[Template:Esbozo]], but that includes/references the global [[Template:Stub]], and so our identification/transport still work. This relies on convincing the various local communities to adopt the global versions of their favorite templates.

The strawman for semantic image styles in T90914: Provide semantic wiki-configurable styles for media display used a similar identification/transport mechanism. [[Mediawiki:ImageStyle-largethumb]] would live on the global wiki, and then any localization on other wikis would still convey the same semantic information so long as it inherited from the global largethumb style. Indeed, the notion of style "inheritance" was included in the strawman specifically to allow semantic transport across localization to work.

There are probably other ways to accomplish these that don't rely on inheritance from a global template conveying semantic meaning. Just as a strawman, you could create a new parser function, {{#semantictag:foo}}. Then your infobox template on eswiki could be named anything it pleases and be implemented in whatever way it liked, so long as it included the invisible {{#semantictag:infobox}} somewhere inside it. This encodes semantic information and transports the semantic tags across projects without relying on global templates / shadow namespaces / inheritance / etc. Templates could be written for image inclusion which also embedded appropriate semantic tags, eg {{largethumb|Foo.jpg|caption}} could expand to [[Foo.jpg|300px|caption{{#semantictag:largethumb}}]]. This has the benefit that semantic tags could more easily be added to various projects w/o convincing them to standardize on global styles/templates/etc -- however, the value will be somewhat less because the semantic tag doesn't guarantee that the user of the template actually provides the semantics in an understandable/portable way. You might need to expand {{#semantictag}} to provide argument mappings, for instance, while these mappings "come free" when you require transport to be via inclusion of a global template with standardized arguments.

Anyway, thanks subbu for starting this discussion!

@ssastry replied:

I finally got around to this over many days and have started filling up https://www.mediawiki.org/wiki/Parsing/Notes/Wikitext_2.0/Typed_Templates

Event Timeline

ssastry updated the task description. (Show Details)

Some comments...

@ssastry:

The TL:DR; of this prologue is that above applications are looking for is a way of capturing wiki-specific / domain-specific semantics that are represented in templates (their names, their categorization, how they are used and are expected to be used, and in combination with which other templates. etc).

Indeed. In particular, how they are used and are expected to be used is a frequently-forgotten, but important element. For example:

  • Is a template supposed to be used on article talk pages? On articles? On user talk pages? On project discussion pages? In other templates? Elsewhere? (see T58516, for example)
  • What should be done about templates when they are translated? (T165053)

These are fascinating question, and I suspect that it was never properly researched.

But, it seems to be that there is a point here where it might be worthwhile thinking of this problem a bit more generically, i.e. is there a separate mechanism / meta layer that can help capture this kind of knowledge / information that will address application needs of multiple applications?

Indeed, again. It's important to define templates as applications. Or even as products (without a designated product manager). Or as features that server particular needs of users (otherwise they wouldn't be created). And to move from thinking about their implementation in wiki syntax or Lua to thinking about what they actually do, how they are actually used, and what is the best way to implement them. It may be wiki syntax, or Lua, or a PHP/JS extension, or a service, or something else.

@Fjalapeno:

I can say that, yes, we have been looking for ways to reliably make sense of semantics of the content in pages for quite some time. If templates didn’t exist, we could probably just analyze the wikitext itself. But with all the transcluded content, we are basically left to writing heuristics that process the output HTML. This is less than ideal, but we could probably get away with this if templates were consistent across projects.

All templates have semantics, and these semantics are usually clear to people who use them—experienced wiki editors.

They are not clear to new wiki editors. Nobody knows that they need to type {{some template name}} when they start editing any MediaWiki wiki. And when the Visual editor shows a template insertion dialog, it doesn't show a template selector; a template name must be known, and must be manually typed (see T55590).

They are also not clear to software, unless software is taught about it specifically. Structured discussions (Flow) knows about mention templates. Twinkle knows about Articles-for-deletion-related templates (only in the English Wikipedia and maybe some other projects to which it was manually adapted).

As you noted, for the Reading clients, we have essentially been migrating our heuristics out of the client apps and into middleware services (MCS). While the techniques we use are sometimes brittle, having them centralized and on the server allows us to react quickly to problems and solve them for all clients at once.

Yes, this is another example of software that was "taught" about the semantics of a particular template. Wouldn't it be much better if hatnotes and infoboxes were real extensions that mobile web and apps could identify precisely in any wiki without brittle heuristics?

We can be bolder and think about learning at least some of the on-wiki processes for which templates are used and replacing them with properly managed features.

Thanks for bringing together the various strands of discussion here. I finally managed to read through the various proposals linked in this task, there are quite a few (balanced / typed templates, template data, wikitext 2.0).

The benefits are clear and I like the idea of attaching more information to templates; this should probably be part of the current template data extension. Having yet another mechanism to attach metadata would be very confusing.

Another approach is to ask template authors to add specific markup (such as
classes / microformats) in their template output. The latter approach was
proposed by community members for wiktionary: T138709: Use microformats on
Wiktionary to improve term parsing. Either way, such mappings should
probably be maintained by the community, either by editing a mapping /
template metadata, or by editing the templates themselves.

I'd like to continue this discussion. Especially with the coming wiktionary-wikidata integration it seems crucial to me to have as much wiktionary content as possible marked up semantically in order to help with the migration.

After playing around a bit with the semantic html rendered by parsoid I was wondering if standardizing on RDFa for embedding wiki-specifics would be a good idea.

All the semantic information could be extracted in one step; there wouldn't be a need for yet another parser. And some of the vocabularies can probably be shared across wikis.

Usage examples on wiktionary are currently marked up with microformats2:

<div class="h-usage-example">
    <i class="Latn mention e-example" lang="fr" xml:lang="fr">
    cousu de fil blanc
    </i> ―
    <span class="e-translation">blatantly obvious</span>
    (literally, “<span class="e-literally">sewn with white thread</span>”)
</div>

With RDFa it could look like this:

<div vocab="http://en.wiktionary.org/rdf" typeof="UsageExample">
    <i class="Latn mention" lang="fr" xml:lang="fr" property="example">
    cousu de fil blanc
    </i> ―
    <span property="translation">blatantly obvious</span>
    (literally, “<span property="literally">sewn with white thread</span>”)
</div>

It's not that much different. I guess we would have to explicitly specify a vocabulary but that would be good for documentation purposes anyway. Would the RDFa specific attributes also show up in the "normal" HTML of the PHP parser?

The type of semantic content (vocabulary) included in the template could also be added to the template metadata. It would be easy to know which templates generate semantic output.