TextExtracts returns "..." on some pages but not others
Open, LowestPublic
Actions

Assigned To

None

Authored By

	JeffreyWang
	Dec 11 2020, 8:42 PM

Description

On the main page of one of our wikis (https://learn.winona.edu/wiki/Main_Page), we are getting inconsistent responses from TextExtracts, and this is preventing Popups from behaving correctly. Sometimes, it returns "...", other times it returns the expected text extract from the page.

At the time of writing, this link returns "...":

https://learn.winona.edu/w/api.php?action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions%7Cinfo&formatversion=2&redirects=true&exintro=true&exchars=525&explaintext=true&piprop=thumbnail&pithumbsize=320&pilicense=any&rvprop=timestamp&inprop=url&titles=Keep_Teaching_Manual&smaxage=0&maxage=0&uselang=content

{"batchcomplete":true,"query":{"normalized":[{"fromencoded":false,"from":"Keep_Teaching_Manual","to":"Keep Teaching Manual"}],"pages":[{"pageid":9367,"ns":0,"title":"Keep Teaching Manual","contentmodel":"wikitext","pagelanguage":"en","pagelanguagehtmlcode":"en","pagelanguagedir":"ltr","touched":"2020-12-07T13:46:22Z","lastrevid":69208,"length":3020,"fullurl":"https://learn.winona.edu/wiki/Keep_Teaching_Manual","editurl":"https://learn.winona.edu/w/index.php?title=Keep_Teaching_Manual&action=edit","canonicalurl":"https://learn.winona.edu/wiki/Keep_Teaching_Manual","extract":"...","revisions":[{"timestamp":"2020-08-17T04:49:25Z"}]}]}}

While this link returns the full text extract you'd expect:

https://learn.winona.edu/w/api.php?action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions%7Cinfo&formatversion=2&redirects=true&exintro=true&exchars=525&explaintext=true&piprop=thumbnail&pithumbsize=320&pilicense=any&rvprop=timestamp&inprop=url&titles=Activate_your_Minnesota_State_Zoom_Pro_host_account&smaxage=300&maxage=300&uselang=content

{"batchcomplete":true,"query":{"normalized":[{"fromencoded":false,"from":"Activate_your_Minnesota_State_Zoom_Pro_host_account","to":"Activate your Minnesota State Zoom Pro host account"}],"pages":[{"pageid":9469,"ns":0,"title":"Activate your Minnesota State Zoom Pro host account","contentmodel":"wikitext","pagelanguage":"en","pagelanguagehtmlcode":"en","pagelanguagedir":"ltr","touched":"2020-12-10T15:26:17Z","lastrevid":70257,"length":3041,"fullurl":"https://learn.winona.edu/wiki/Activate_your_Minnesota_State_Zoom_Pro_host_account","editurl":"https://learn.winona.edu/w/index.php?title=Activate_your_Minnesota_State_Zoom_Pro_host_account&action=edit","canonicalurl":"https://learn.winona.edu/wiki/Activate_your_Minnesota_State_Zoom_Pro_host_account","extract":"Activate your Minnesota State Zoom Pro host account to host your own online meetings using the Pro version of Zoom. Activating your free Pro account provided by Minnesota State will give you access to all the features and functions of the Pro version of Zoom for as long as you have an active Minnesota State StarID username and password....","revisions":[{"timestamp":"2020-12-03T06:02:33Z"}]}]}}

This problem seems to be nondeterministic. Sometimes, all of the API queries will work, other times, none of them will work, or some combination in between.

We are using MediaWiki 1.35.0, PHP 7.4.13, MariaDB 10.1.47. https://learn.winona.edu/wiki/Special:Version

Details

	Subject	Repo	Branch	Lines +/-
	Skip <h2> in TOC when extracting first section	mediawiki/extensions/TextExtracts	master	+13 -2
	Prevent TOC heading from being captured in regex and returning blank extract	mediawiki/extensions/TextExtracts	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned Here: T256505: TextExtracts extension: Code stewardship review

Event Timeline

JeffreyWang created this task.Dec 11 2020, 8:42 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 11 2020, 8:42 PM

Fyi T256505: TextExtracts extension: Code stewardship review

Jdlrobson added a project: Web-Team-Backlog (Tracking).Dec 11 2020, 9:05 PM

Remove the exintro parameter and it works well. The parameter must be removed in order to actually be false - if it's set to true, false, or anything else it is taken as true, which is why the way $wgPopupsTextExtractsIntroOnly is implemented doesn't make sense to me.

Interesting, thanks @Porplemontage for the insight. Notwithstanding the confusing implementation of TextExtracts, I removed the line exintro: true from Popups/src/gateway/mediawiki.js and since I do not know how to compile the frontend JS into the resources/dist/index.js file, I manually removed it from there and it fixed this issue. Now the only issue is it won't stop at the intro (like in this API call it includes the headers and stuff: https://learn.winona.edu/w/api.php?action=query&format=json&prop=info%7Cextracts%7Cpageimages%7Crevisions%7Cinfo&formatversion=2&redirects=true&exchars=525&explaintext=true&piprop=thumbnail&pithumbsize=320&pilicense=any&rvprop=timestamp&inprop=url&titles=Join_a_Zoom_meeting_as_a_participant&smaxage=300&maxage=300&uselang=content), but that's better than nondeterministic "..." and should satisfy our client enough at this point. Hopefully someone can do some more detailed debugging and see why this is the case.

Ah, not a perfect solution then. I figured out the root cause of the issue (at least in my case). What I was doing is including __TOC__ before the body text (floated to the right), and the TOC includes a h2 header. The regex in TextExtract's ApiQueryExtracts.php getFirstSection function thinks the h2 in the TOC is the article's first section, and is getting all content before it, which is nothing (plaintext is false so it's looking for <h?> tags). It doesn't matter that the TOC is a div or wrapped in a div since that stuff is stripped out after the regex is run.

This is only a problem when getting the fulltext from the parser cache and then finding the first section with regex. If you comment-out the full page tries, it will send a query with section=0 and get the real top section, and the TOC doesn't cause an issue. My current solution is to change this line:

$regexp = '/^(.*?)(?=<h[1-6]\b)/s';

to:

$regexp = '/^(.*?)(?=<h[1-6]>)/s';

Had to restart memcache after to see changes. The reason this works is because the TOC h2 tag always has a space and the main body h2 tags always close right away (<h2>). Skin, for example doesn't affect the API output, does it? I would like to see this in the extension itself so I don't have to keep hacking it, if we don't think it would cause any issues (and there are certainly safer ways of doing it, like strip the TOC h2 from the string before the regex)

Change 655159 had a related patch set uploaded (by Krinkle; owner: Porplemontage):
[mediawiki/extensions/TextExtracts@master] Prevent TOC heading from being captured in regex and returning blank extract

https://gerrit.wikimedia.org/r/655159

gerritbot added a project: Patch-For-Review.Jan 11 2021, 6:52 PM

Change 697418 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/TextExtracts@master] Skip <h2> in TOC when extracting first section

https://gerrit.wikimedia.org/r/697418

Jdlrobson moved this task from Untriaged to Untag on the Web-Team-Backlog (Tracking) board.Sep 8 2021, 3:26 PM

Jdlrobson removed a project: Web-Team-Backlog (Tracking).Dec 7 2021, 7:48 PM

Change 697418 merged by jenkins-bot: