Page MenuHomePhabricator

[SPIKE] Mint section identifiers
Closed, ResolvedPublic

Description

The current section topics algorithm doesn't have a way to generate section identifiers.
These are likely to be required by front-end clients to effectively retrieve a given section.
See comment in T312900#8105248.

Tasks

  • Check whether mediawikiparserfromhell has the same behavior as the MediaWiki API for section identifiers, see example

Highlights

  • I couldn't find an evident way to replicate the MediaWiki API behavior in mediawikiparserfromhell (mwp)
  • it's also known that mwp can't handle syntax elements produced by a template transclusion, see limitations. This is likely to entail that sections transclusion raised in T312900#8105248 can't be easily handled

Proposal

A minimum viable solution is to use the section absolute index as a simple identifier. This would be identical to the MediaWiki API index key, see for instance the section object in the example call:

{
    "toclevel": 1,
    "level": "2",
    "line": "Discografia",
    "number": "3",
    "index": "16",
    "fromtitle": "Ramones",
    "byteoffset": 94346,
    "anchor": "Discografia"
}

This is the 16th section that appears in the Ramones page on itwiki, regardless of its hierarchy level.


The example dataset row given in T312900: [M] Design database model for section topics pipeline would become:

snapshotwiki_dbpage_namespacerevision_idpage_qidpage_idpage_titlesection_idsection_titletopic_qidtopic_titletopic_score
2022-07-11enwiki01066420146Q446428710391760Work_(painting)1background and influencesQ543626Lazzaroni_(Naples)2

Important note

Current Research code extracts sections at hierarchy level 2 exclusively.
Given the following dummy wikitext:

section zero
== section one ==
...
=== section one one ===
...
== section two ==
...
=== section two one ===
...
==== section two one one ====
...
==== section two one two ====
...
== section three ==
...

the code will extract a total of 3 sections, each holding their subsections:

[
    '== section one ==\n...\n=== section one one ===\n...',
    '== section two ==\n...\n=== section two one ===\n...\n==== section two one one ====\n...\n==== section two one two ====\n...',
    '== section three ==\n...'
]

Event Timeline

We're using mwparserfromhell. Given a wikitext, current Research code is: mwparserfromhell.parse(wikitext).get_sections(levels=[2], include_headings=True). This should extract h2-level sections and won't include the lead section. We can modify the behavior, but the lead section potentially contains infobox templates and the like, so not sure we want to parse that as well, what do you think?

It contains a lot of templates, but also a concise summary of the whole article, and skipping the templates should be straightforward with mwparserfromhell. Not sure how useful it would be to have topics for that - on one hand, the topic of the lead section should be the same as the topic of the whole article; conceptually, it doesn't have its own topics. But then, we don't have article-level topics so topics for section 0 could be a nice approximation for that. (Of course, in practice the lead section could be missing or poorly written and not really reflective of the article.) Not sure what use cases we have for section 0 - I assume for image suggestions we wouldn't use it as the current image suggestion pipeline already does top-level image suggestions?

Unrelated to section 0, maybe it's worth noting that MediaWiki considers raw <h2> tags in the wikitext as section titles but mwparserfromhell doesn't, so in theory they don't always agree on what sections an article has. It's very unlikely you'd encunter raw <h2> tags in wikitext though, most wikis consider it a code style violation.

It contains a lot of templates, but also a concise summary of the whole article, and skipping the templates should be straightforward with mwparserfromhell.

True, tracking below one way to remove templates:

zero = wikitext.get_sections()[0]
templates = zero.filter_templates()
for t in templates:
    zero.remove(t)

Not sure how useful it would be to have topics for that - on one hand, the topic of the lead section should be the same as the topic of the whole article; conceptually, it doesn't have its own topics. But then, we don't have article-level topics so topics for section 0 could be a nice approximation for that.
Not sure what use cases we have for section 0

Section 0 topics may be a great starting point for the schema.org use case, see T302735: [SPIKE] Investigate effects of schema.org section metadata on Google search results, CC @CBogen @AUgolnikova-WMF

I assume for image suggestions we wouldn't use it as the current image suggestion pipeline already does top-level image suggestions?

In principle yes.

Unrelated to section 0, maybe it's worth noting that MediaWiki considers raw <h2> tags in the wikitext as section titles but mwparserfromhell doesn't, so in theory they don't always agree on what sections an article has. It's very unlikely you'd encunter raw <h2> tags in wikitext though, most wikis consider it a code style violation.

Got it, thanks for the clarification.

mfossati changed the task status from Open to In Progress.Aug 8 2022, 2:16 PM