Page MenuHomePhabricator

[SPIKE] Investigate bad section parsing
Closed, ResolvedPublic

Description

The current wikitext parser we use, i.e., mwparserfromhell, may lead to wrong section extractions, thus impacting the final Section-Topics output and subsequent use cases, such as Section-Level-Image-Suggestions.

Bad section hierarchy

Pages with malformed wikitext can confuse the parser, which won't understand the correct section hierarchy.
For instance, in Cattivik, sections 2-4 are eaten by section 1, i.e., parsed into subsections, although they are at the same hierarchy level.
@Tgr pointed out the following: it's not that it can't find the sections in Cattivik, it just doesn't see them at the top of the AST (somewhat reasonably, the wikitext is slightly malformed, and I doubt you find any parser other than Parsoid which tries to simulate the PHP parser's behavior for malformed wikitext).
(specifically, the PHP parser parses «''n'n rispond' senza il mio avvocat'''» as «<i>n'n rispond' senza il mio avvocat</i>'» while mwparserfromhell seems to parse it as «<i>n'n rispond' senza il mio avvocat<b>» and then the <i> tag ends up wrapping the next few sections.)

How big a problem is it? Proposal for how to measure it:

  • for N wikis, create a random set of X pages
  • use mwparserfromhell to split each page in the set into sections, count the sections, and store the section count for each page
  • for each page in the set call https://en.wikipedia.org/api/rest_v1/page/mobile-sections/{page_title}, count the sections and store the count
  • compare the section count from each method - on what percentage of pages do the 2 counts disagree

One we know how big the problem is we can decide whether we need to fix it. If we do need to fix it here are some potential ways we might go about it

  • @dr0ptp4kt pointed to Parsoid-enabled endpoints in https://en.wikipedia.org/api/rest_v1, such as the /page/mobile-section/{title}. This needs further investigation, especially to understand whether we can use them as a library. REST endpoints don't seem like a viable solution, as they would impact the data processing execution time
  • @Tgr suggested that we can get good results with mwparserfromhell by iterating through all nodes, noting when you find second-level headers and then getting the wikitext between them. This requires changing the section extraction logic, and may be a medium effort with a high gain
  • Investigate other available parsers for Python

Note that it also happens that infobox blue links are associated to a wrong section. See for instance the following algorithm output row:

+-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+
|wiki_db|page_wikidata_item|          page_title|page_id|     section_heading|section_link_item_id|commons_id| pid|
+-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+
| ptwiki|               Q51|           Antártida|    396|           toponímia|               Q3960|   8727360|p373|
+-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+

This may be a related issue that requires further digging.

Reports

We implemented the proposal and compared mwparserfromhell with the Wikimedia REST and Action APIs.

NOTE: all samples are random, so they differ from each other.

Wikimedia REST API

Example call: https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Joe_Strummer/1116897025

100 pages per wiki
500 pages per wiki
  • sample 1:
  • sample 2:

Action API

Example call: https://en.wikipedia.org/w/api.php?action=parse&oldid=1116897025&prop=sections&format=json

100 pages per wiki
  • sample A:
    • 🐞 mean percentage: 9.7%, see P38058
    • per wiki ratio: P38057
  • sample B:
    • 🐞 mean percentage: 10.7%, see P38060
    • per wiki ratio: P38059
500 pages per wiki

Notes

We dug into the Wikimedia REST API 500 per wiki sample 1:

  • total = 133,221 pages
  • disagreement = 2,610 pages
  • ratio = 0.019 (~2%)
  • the API yielded 687 sections with no TOC level. This biases the check, so the ratio is an upper bound of bad parsing
  • 1,498 pages have less sections with mwparserfromhell
  • 1,112 pages have more sections with mwparserfromhell
  • disagreement details: P37933
  • 👽 in P37933 there are
    • 45 pages from the 10 Wikipedias with most article pages - source
    • 69 from the top-20
    • 95 from the top-30

REST VS Action APIs

In sample 1 we found a subset of pages where the Wikimedia REST API disagrees with the Action API:

  • disagreement = 5,261 pages
  • ratio = 0.039 (~4%)
  • disagreement details: P38790

This disagreement slightly decreases if we take into account sections with missing TOC levels:

  • the Action API had all sections with TOC levels
  • disagreement = 4,868 pages
  • ratio = 0.0365 (~3.7%)
  • disagreement details: P38795

From 16 random pages In P38795 there are exotic cases where it's hard to tell which API is right:

Conclusion

Based on notes marked with an emoji, here are our conclusions:

  • 👽 we suspect that most errors come from the long tail of small Wikipedias
  • 🐴 exotic pages without standard section headers would break @Tgr 's solution
  • 🐴 if the Action API is more reliable than the REST one, 🐞 an average 10% of bad parsing is not acceptable per se
  • 👽 however, if that 10% mainly comes from the long tail of exotic pages in small Wikipedias, then I think we should make a product decision, CC @AUgolnikova-WMF
  • if we decide to fix this, we’ll have to figure out the most effective change to our section extraction logic. Finding the level-2 headers will still not work for those exotic pages, so it will be a matter of traversing the wikitext tree right

Next steps

According to P38061, 36 out of 305 Wikipedias have more than 10% of badly parsed pages, while it is tolerable for the vast majority. See P40228.

We will pre-compute all pages with bad parsing and filter them out for problematic wikis, starting with ptwiki in T323489: Pre-compute bad section parsing for ptwiki.

The need for custom parsing solutions will be tackled if we have to increase the volume of section-level image suggestions.

Related Objects

Event Timeline

There are Enterprise HTML dumps, not sure how stable those are. Using the actual HTML probably is the most robust method, and I don't think it would be terrible performance-wise, as DOM traversal is a well-optimized thing. The only Parsoid client is in JS, but I don't think it's too hard to hand-code the logic - it's just DOM traversal based on the typeof attributes of Parsoid-generated elements and a few other things. Creating a Parsoid client in Python would probably be a worthwhile contribution to the research space on its own.

The scope of this spike is to understand the extent of the problem; and propose whether it needs to be fixed or can be ignored; and if it needs to be fixed; to propose a potential solution. Solving the problem is *not* in scope of this ticket.

Cparle renamed this task from [SPIKE] Find a way to avoid bad section parsing to [SPIKE] Investigate bad section parsing.Oct 26 2022, 1:58 PM
Cparle updated the task description. (Show Details)
mfossati changed the task status from Open to In Progress.Oct 31 2022, 3:49 PM
mfossati claimed this task.
mfossati updated the task description. (Show Details)

Closing, decisions made. See Next steps in the task description.