The current wikitext parser we use, i.e., mwparserfromhell, may lead to wrong section extractions, thus impacting the final Section-Topics output and subsequent use cases, such as Section-Level-Image-Suggestions.
Bad section hierarchy
Pages with malformed wikitext can confuse the parser, which won't understand the correct section hierarchy.
For instance, in Cattivik, sections 2-4 are eaten by section 1, i.e., parsed into subsections, although they are at the same hierarchy level.
@Tgr pointed out the following: it's not that it can't find the sections in Cattivik, it just doesn't see them at the top of the AST (somewhat reasonably, the wikitext is slightly malformed, and I doubt you find any parser other than Parsoid which tries to simulate the PHP parser's behavior for malformed wikitext).
(specifically, the PHP parser parses «''n'n rispond' senza il mio avvocat'''» as «<i>n'n rispond' senza il mio avvocat</i>'» while mwparserfromhell seems to parse it as «<i>n'n rispond' senza il mio avvocat<b>» and then the <i> tag ends up wrapping the next few sections.)
How big a problem is it? Proposal for how to measure it:
- for N wikis, create a random set of X pages
- use mwparserfromhell to split each page in the set into sections, count the sections, and store the section count for each page
- for each page in the set call https://en.wikipedia.org/api/rest_v1/page/mobile-sections/{page_title}, count the sections and store the count
- compare the section count from each method - on what percentage of pages do the 2 counts disagree
One we know how big the problem is we can decide whether we need to fix it. If we do need to fix it here are some potential ways we might go about it
- @dr0ptp4kt pointed to Parsoid-enabled endpoints in https://en.wikipedia.org/api/rest_v1, such as the /page/mobile-section/{title}. This needs further investigation, especially to understand whether we can use them as a library. REST endpoints don't seem like a viable solution, as they would impact the data processing execution time
- @Tgr suggested that we can get good results with mwparserfromhell by iterating through all nodes, noting when you find second-level headers and then getting the wikitext between them. This requires changing the section extraction logic, and may be a medium effort with a high gain
- Investigate other available parsers for Python
Note that it also happens that infobox blue links are associated to a wrong section. See for instance the following algorithm output row:
+-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+ |wiki_db|page_wikidata_item| page_title|page_id| section_heading|section_link_item_id|commons_id| pid| +-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+ | ptwiki| Q51| Antártida| 396| toponímia| Q3960| 8727360|p373| +-------+------------------+--------------------+-------+--------------------+--------------------+----------+----+
This may be a related issue that requires further digging.
Reports
We implemented the proposal and compared mwparserfromhell with the Wikimedia REST and Action APIs.
Wikimedia REST API
Example call: https://en.wikipedia.org/api/rest_v1/page/mobile-sections/Joe_Strummer/1116897025
100 pages per wiki
500 pages per wiki
- sample 1:
- sample 2:
Action API
Example call: https://en.wikipedia.org/w/api.php?action=parse&oldid=1116897025&prop=sections&format=json
100 pages per wiki
- sample A:
- sample B:
500 pages per wiki
Notes
We dug into the Wikimedia REST API 500 per wiki sample 1:
- total = 133,221 pages
- disagreement = 2,610 pages
- ratio = 0.019 (~2%)
- the API yielded 687 sections with no TOC level. This biases the check, so the ratio is an upper bound of bad parsing
- 1,498 pages have less sections with mwparserfromhell
- 1,112 pages have more sections with mwparserfromhell
- disagreement details: P37933
- 👽 in P37933 there are
- 45 pages from the 10 Wikipedias with most article pages - source
- 69 from the top-20
- 95 from the top-30
REST VS Action APIs
In sample 1 we found a subset of pages where the Wikimedia REST API disagrees with the Action API:
- disagreement = 5,261 pages
- ratio = 0.039 (~4%)
- disagreement details: P38790
This disagreement slightly decreases if we take into account sections with missing TOC levels:
- the Action API had all sections with TOC levels
- disagreement = 4,868 pages
- ratio = 0.0365 (~3.7%)
- disagreement details: P38795
From 16 random pages In P38795 there are exotic cases where it's hard to tell which API is right:
- REST was wrong in 10 pages
- 🐴 Action was wrong in 6 pages, but it actually was really wrong in only 1 page. The 6 pages had no sections rendered with expected <h2> HTML tags, but the Action API correctly included them in TOC level 1. For instance:
Conclusion
Based on notes marked with an emoji, here are our conclusions:
- 👽 we suspect that most errors come from the long tail of small Wikipedias
- 🐴 exotic pages without standard section headers would break @Tgr 's solution
- 🐴 if the Action API is more reliable than the REST one, 🐞 an average 10% of bad parsing is not acceptable per se
- 👽 however, if that 10% mainly comes from the long tail of exotic pages in small Wikipedias, then I think we should make a product decision, CC @AUgolnikova-WMF
- if we decide to fix this, we’ll have to figure out the most effective change to our section extraction logic. Finding the level-2 headers will still not work for those exotic pages, so it will be a matter of traversing the wikitext tree right
Next steps
According to P38061, 36 out of 305 Wikipedias have more than 10% of badly parsed pages, while it is tolerable for the vast majority. See P40228.
We will pre-compute all pages with bad parsing and filter them out for problematic wikis, starting with ptwiki in T323489: Pre-compute bad section parsing for ptwiki.
The need for custom parsing solutions will be tackled if we have to increase the volume of section-level image suggestions.