Page MenuHomePhabricator

HTML version of text extracts is not balanced/well formed and naive
Closed, DeclinedPublic

Description

  • Currently, an HTML extract looks something like this:

<p>1\n</p>\n\n<p>When writing systems were created in ancient civilizations, a variety of objects, such\n<!-- Tidy found serious XHTML errors -->...

Notice how the second <p> is not closed, and how we're shipping extra debug information. Make sure the new config actually fixes the invalid code and doesn't output any debug information.

The issues came up while we were working on T156467.

  • Using number of characters by string slicing can have unexpected consequences and we may want to revisit how we do that. See T92628

Event Timeline

Jdlrobson renamed this task from Improve HTML version of text extracts to HTML version of text extracts is not balanced.May 25 2017, 4:08 PM
Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.
Jdlrobson renamed this task from HTML version of text extracts is not balanced to HTML version of text extracts is not balanced/well formed.May 25 2017, 4:11 PM
Jdlrobson subscribed.

Instead of using wgTidy we could be less naive when constructing the html extract.. E.g. Iterate through dom elements. Using wgtidy to me feels like a last resort.

Jdlrobson changed the task status from Open to Stalled.May 26 2017, 3:44 PM

On discussing and speccing out the epic

Jdlrobson renamed this task from HTML version of text extracts is not balanced/well formed to HTML version of text extracts is not balanced/well formed and naive.May 26 2017, 6:32 PM
Jdlrobson updated the task description. (Show Details)
phuedx changed the task status from Stalled to Open.Jun 20 2017, 4:48 PM
phuedx subscribed.

I'm un-stalling this. Both TextExtracts and the RESTBase page summary endpoint should return well-formed HTML extracts in all cases.

@phuedx it was stalled because we hadn't decided how to do this not if we should do this. In TextExtracts or in RESTBASE? wgTidy or DOM iteration.

@phuedx it was stalled because we hadn't decided how to do this not if we should do this. In TextExtracts or in RESTBASE? wgTidy or DOM iteration.

TextExtracts as this is where the buggy behaviour is.

Let's see if Tidy can actually fix these errors first – if nothing else, we shouldn't be relying on deprecated APIs. If not, then we'll have to move to DOM iteration. Let's loop in someone from Parsing-Team--ARCHIVED?

Hopefully this can be fixed by T168329. Right now sentences are determined by a regex. Using HTMLFormatter will make the end result much tidier.

Jdlrobson triaged this task as Medium priority.Jun 22 2017, 6:23 PM
Jdlrobson changed the task status from Open to Stalled.Jul 6 2017, 7:20 PM

Stalled until we have come up with a clear definition in T113094

@phuedx and I have made the decision that this will not be fixed. T170617 will make sure that api consumers know about this problem. For those who want well formed HTML we will be providing a new service on RESTBase which will guarantee that (please follow along with T113094)

For those who want well formed HTML we will be providing a new service on RESTBase which will guarantee that (please follow along with T113094)

That thread is a bit confusing on this particular point. Would it be correct to assume that T165017 is the task related to providing well-formed HTML in the summary extracts? In which case, it being resolved then means that the functionality is now available and ready for consumption by third-parties?

Hi @waldyrious sorry for the confusion. The API will be hosted here:
https://en.wikipedia.org/api/rest_v1/page/summary/Spain

We are currently in the process of switching out the old version to the new version which will have well formed HTML. Please track T177431 to keep up to date with updates there!