Change Details

== Background In {T113094}, many issues were identified with previews not rendering appropriate text for articles. The majority of these issues are identified below. Any issues not identified here will be pushed to a future iteration. == Acceptance criteria **Mathematical expressions**: All mathematical expressions must be rendered as they appear in the original article, including subscripts and formulae (https://phabricator.wikimedia.org/T141766 and https://phabricator.wikimedia.org/T112137) **Bolding**: Bolding will appear as within the article (https://phabricator.wikimedia.org/T141651) **Parentheticals**: Parentheticals will be stripped (https://phabricator.wikimedia.org/T91344). Note: while there are edge cases which do not make sense for this, we will look at this separately for the second iteration **<noinclude>**: Previews must not display <noinclude> content (more info here: https://phabricator.wikimedia.org/T109869 ) **lists**: If the article’s first paragraph contains a list, the list will be presented in the summary (bulleted lists will be presented as bullets, numbered lists as numbers) more info here: https://phabricator.wikimedia.org/T59850 Note well that for the above edge cases, the generic preview may be used if a more efficient solution cannot be identified * All tasks associated with a particular feature are closed once that feature is implemented. ** This should usually go without saying but let's take the time to comb the Page Previews and TextExtracts backlogs for duplicate tasks. == Proposal The proposal has the following goals: * Keep the `extracts` API query module as generic as possible. * Contain Page Previews knowledge within its extension (see above). * Support third-party MediaWiki installations. ** Unless the PO/TL decide we're not going to do this and let folk We create a query module within the Page Previews extension, `pp_extract`, which can produce an extract that satisfies the requirements above. It: * Will defer to `ExtractFormatter`, provided by the TextExtracts extension, to do extract selection and content filtering. ** i.e. it'll filter out everything that doesn't match AC #1, #2, #4, and #5. * Will not accept any parameters. * Will only operate on one page – the Page Previews API request is only ever for one page. Tying this into RestBase should be as trivial as changing `'extracts'` to `'pp_extract'` and removing the `ex`-prefix query parameters. === Plan (YMMV) * Extract `ApiQueryExtracts#getFirstSection( $text, $isPlainText )` to `TextExtracts\Extractor#getFirstSection( $html )`. ** It might be worth extracting `ExtractFormatter::getFirstChars` and `::getFirstSentences` too… * Create PagePreviews\HtmlElementFilter`, which uses `ExtractFormatter` with a the HTML element whitelist that satisfies the AC above. * Create PagePreviews\ParentheticalFilter`, which satisfies AC3. * Create PagePreviews\ApiQueryPPExtract`, which ties the above into the API **TODO**: There might be a piece missing about where we get the HTML content from, `PagePreviews\ExtractSource` maybe?

== Background In {T113094}, many issues were identified with previews not rendering appropriate text for articles. The majority of these issues are identified below. Any issues not identified here will be pushed to a future iteration. == Acceptance criteria **Mathematical expressions**: All mathematical expressions must be rendered as they appear in the original article, including subscripts and formulae (https://phabricator.wikimedia.org/T141766 and https://phabricator.wikimedia.org/T112137) **Bolding**: Bolding will appear as within the article (https://phabricator.wikimedia.org/T141651) **Parentheticals**: Parentheticals will be stripped (https://phabricator.wikimedia.org/T91344). Note: while there are edge cases which do not make sense for this, we will look at this separately for the second iteration **<noinclude>**: Previews must not display <noinclude> content (more info here: https://phabricator.wikimedia.org/T109869 ) **lists**: If the article’s first paragraph contains a list, the list will be presented in the summary (bulleted lists will be presented as bullets, numbered lists as numbers) more info here: https://phabricator.wikimedia.org/T59850, {T156369} Note well that for the above edge cases, the generic preview may be used if a more efficient solution cannot be identified * All tasks associated with a particular feature are closed once that feature is implemented. ** This should usually go without saying but let's take the time to comb the Page Previews and TextExtracts backlogs for duplicate tasks. == Proposal The proposal has the following goals: * Keep the `extracts` API query module as generic as possible. * Contain Page Previews knowledge within its extension (see above). * Support third-party MediaWiki installations. ** Unless the PO/TL decide we're not going to do this and let folk We create a query module within the Page Previews extension, `pp_extract`, which can produce an extract that satisfies the requirements above. It: * Will defer to `ExtractFormatter`, provided by the TextExtracts extension, to do extract selection and content filtering. ** i.e. it'll filter out everything that doesn't match AC #1, #2, #4, and #5. * Will not accept any parameters. * Will only operate on one page – the Page Previews API request is only ever for one page. Tying this into RestBase should be as trivial as changing `'extracts'` to `'pp_extract'` and removing the `ex`-prefix query parameters. === Plan (YMMV) * Extract `ApiQueryExtracts#getFirstSection( $text, $isPlainText )` to `TextExtracts\Extractor#getFirstSection( $html )`. ** It might be worth extracting `ExtractFormatter::getFirstChars` and `::getFirstSentences` too… * Create PagePreviews\HtmlElementFilter`, which uses `ExtractFormatter` with a the HTML element whitelist that satisfies the AC above. * Create PagePreviews\ParentheticalFilter`, which satisfies AC3. * Create PagePreviews\ApiQueryPPExtract`, which ties the above into the API **TODO**: There might be a piece missing about where we get the HTML content from, `PagePreviews\ExtractSource` maybe?