HTML version of text extracts is not balanced/well formed and naive
Closed, DeclinedPublic
Actions

Assigned To

None

Authored By

	• bmansurov
	May 24 2017, 10:11 PM

Description

Currently, an HTML extract looks something like this:

<p>1\n</p>\n\n<p>When writing systems were created in ancient civilizations, a variety of objects, such\n...

Notice how the second <p> is not closed, and how we're shipping extra debug information. Make sure the new config actually fixes the invalid code and doesn't output any debug information.

The issues came up while we were working on T156467.

Using number of characters by string slicing can have unexpected consequences and we may want to revisit how we do that. See T92628

$wgUseTidy used in Extension:TextExtracts is deprecated. Use $wgTidyConfig instead.
This was moved to T168671: Update usage of deprecated wgUseTidy.

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Declined		None	T166272 HTML version of text extracts is not balanced/well formed and naive
		Open		None	T92628 Unicode replacement character \ufffd appear in text extract if exchar cuts at section boundary

Event Timeline

• bmansurov created this task.May 24 2017, 10:11 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 24 2017, 10:11 PM

Jdlrobson renamed this task from Improve HTML version of text extracts to HTML version of text extracts is not balanced.May 25 2017, 4:08 PM

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.

Jdlrobson added a parent task: T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.

Instead of using wgTidy we could be less naive when constructing the html extract.. E.g. Iterate through dom elements. Using wgtidy to me feels like a last resort.

waldyrious awarded a token.May 26 2017, 12:35 AM

waldyrious subscribed.

On discussing and speccing out the epic

Jdlrobson renamed this task from HTML version of text extracts is not balanced/well formed to HTML version of text extracts is not balanced/well formed and naive.May 26 2017, 6:32 PM

Jdlrobson added a subtask: T92628: Unicode replacement character \ufffd appear in text extract if exchar cuts at section boundary.

Jdlrobson updated the task description. (Show Details)

I'm un-stalling this. Both TextExtracts and the RESTBase page summary endpoint should return well-formed HTML extracts in all cases.

@phuedx it was stalled because we hadn't decided how to do this not if we should do this. In TextExtracts or in RESTBASE? wgTidy or DOM iteration.

In T166272#3364566, @Jdlrobson wrote:

@phuedx it was stalled because we hadn't decided how to do this not if we should do this. In TextExtracts or in RESTBASE? wgTidy or DOM iteration.

TextExtracts as this is where the buggy behaviour is.

Let's see if Tidy can actually fix these errors first – if nothing else, we shouldn't be relying on deprecated APIs. If not, then we'll have to move to DOM iteration. Let's loop in someone from Parsing-Team--ARCHIVED?

ovasileva moved this task from Needs Prioritization to Incoming on the Web-Team-Backlog board.Jun 21 2017, 1:43 PM

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Jun 21 2017, 4:09 PM

phuedx mentioned this in T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.Jun 22 2017, 1:17 PM

Jdlrobson moved this task from Needs Prioritization to Triaged but Future on the Web-Team-Backlog board.Jun 22 2017, 4:44 PM

ovasileva moved this task from Triaged but Future to Upcoming on the Web-Team-Backlog board.Jun 22 2017, 5:35 PM

Hopefully this can be fixed by T168329. Right now sentences are determined by a regex. Using HTMLFormatter will make the end result much tidier.

Jdlrobson moved this task from Upcoming to Needs Prioritization on the Web-Team-Backlog board.Jun 22 2017, 5:39 PM

phuedx mentioned this in T168671: Update usage of deprecated wgUseTidy.Jun 22 2017, 5:43 PM

phuedx updated the task description. (Show Details)

Jdlrobson triaged this task as Medium priority.Jun 22 2017, 6:23 PM

phuedx mentioned this in T168329: exsentences does not work correctly when HTML output used.Jun 22 2017, 6:26 PM

Stalled until we have come up with a clear definition in T113094

@phuedx and I have made the decision that this will not be fixed. T170617 will make sure that api consumers know about this problem. For those who want well formed HTML we will be providing a new service on RESTBase which will guarantee that (please follow along with T113094)

In T166272#3437125, @Jdlrobson wrote:

For those who want well formed HTML we will be providing a new service on RESTBase which will guarantee that (please follow along with T113094)

That thread is a bit confusing on this particular point. Would it be correct to assume that T165017 is the task related to providing well-formed HTML in the summary extracts? In which case, it being resolved then means that the functionality is now available and ready for consumption by third-parties?

Hi @waldyrious sorry for the confusion. The API will be hosted here:
https://en.wikipedia.org/api/rest_v1/page/summary/Spain

We are currently in the process of switching out the old version to the new version which will have well formed HTML. Please track T177431 to keep up to date with updates there!

• Jhernandez removed a parent task: T113094: [EPIC] The Page Summary API needs to provide useful content for the majority of articles.Sep 11 2019, 12:23 PM

HTML version of text extracts is not balanced/well formed and naiveClosed, DeclinedPublicActions

Description

Related ObjectsSearch...

Event Timeline

HTML version of text extracts is not balanced/well formed and naive
Closed, DeclinedPublic
Actions

Related Objects
Search...