Page MenuHomePhabricator

Strip <br> tags from extracts
Open, LowPublic

Description

mobile view

On https://ja.wikipedia.org/wiki/%E6%B0%B4%E4%B8%AD%E3%80%81%E3%81%9D%E3%82%8C%E3%81%AF%E8%8B%A6%E3%81%97%E3%81%84 there's a <br clear="both" /> in the article.

That br tag shows up in the extract as well, causing some issues for downstream users (example: https://musicbrainz.org/artist/6fb627d9-983e-43c5-bf73-efcf8e81926b).

There's also extra whitespace in the mobile view (see attachment) which I think is using related code?.

MusicBrainz bug report is http://tickets.musicbrainz.org/browse/MBS-7948


Version: unspecified
Severity: normal

Attached:

Replication steps

On http://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exchars=100000000&titles=Test%20br%20tags%20in%20extracts
<br> tag shows which is expected since html is requested but this leads a random empty space.

With explaintext flag set it doesn't show:
http://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&prop=extracts&format=json&exchars=100000000&explaintext=&titles=Test%20br%20tags%20in%20extracts

We would like to rethink this behaviour.

AC

Details

Reference
bz72546

Event Timeline

bzimport raised the priority of this task from to Needs Triage.
bzimport set Reference to bz72546.
bzimport added a subscriber: Unknown Object (MLST).
Legoktm created this task.Oct 27 2014, 1:51 AM

Is this still an issue? The example no longer works for me :(

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 18 2015, 6:37 PM

The br tag was removed at my request to work around this issue: https://ja.wikipedia.org/w/index.php?title=%E6%B0%B4%E4%B8%AD%E3%80%81%E3%81%9D%E3%82%8C%E3%81%AF%E8%8B%A6%E3%81%97%E3%81%84&type=revision&diff=53346360&oldid=53061468

You could use that old revision to test if the bug still exists.

Thanks @Legoktm that's super helpful. Will take a look Monday.

Legoktm reopened this task as Open.Nov 1 2015, 2:48 AM

Yes, that's what this bug is about. There's no good reason for a <br clear="both" /> in the extract.

Yes, that's what this bug is about. There's no good reason for a <br clear="both" /> in the extract.

Well.. you've asked for HTML so it makes sense to include it, but I can understand that in some applications this might not be useful... however some applications it may be. Maybe what is actually needed here is an additional API parameter which explicitly asks for an extract without any unnecessary* formatting..?

  • we'd have to define what this means.
Jdlrobson triaged this task as Low priority.Nov 2 2015, 8:08 PM
phuedx added a subscriber: phuedx.Jun 20 2017, 4:53 PM

Well.. you've asked for HTML so it makes sense to include it, but I can understand that in some applications this might not be useful... however some applications it may be. Maybe what is actually needed here is an additional API parameter which explicitly asks for an extract without any unnecessary* formatting..?

  • we'd have to define what this means.

If we need a decision maker, I can be that person: I think this makes sense. I could email mediawiki-api-announce and see if there are cases when this behaviour wouldn't be desirable?

Sounds good to me @phuedx that would be a great first step.

MaxSem removed a subscriber: MaxSem.Jun 20 2017, 6:43 PM
MaxSem added a subscriber: MaxSem.
MaxSem removed a subscriber: MaxSem.
phuedx renamed this task from Strip <br> tags from extracts? to Strip <br> tags from extracts.Jun 22 2017, 10:00 AM

If we need a decision maker, I can be that person: I think this makes sense. I could email mediawiki-api-announce and see if there are cases when this behaviour wouldn't be desirable?

Here's the archived thread: https://lists.wikimedia.org/pipermail/mediawiki-api/2017-June/004001.html

Let's give people a week to respond to the email. If no problems we'll go ahead and do this.

Given lack of responses I guess we should go ahead with this?
Are there any cases where removing the br tag may be problematic?

e.g. poetry?

'Er petticoat was yaller an' 'er little cap was green,
An' 'er name was Supi-yaw-lat - jes' the same as Theebaw's Queen, 
An' I seed her first a-smokin' of a whackin' white cheroot,
An' a-wastin' Christian kisses on an 'eathen idol's foot:
Bloomin' idol made o' mud 
Wot they called the Great Gawd Budd
Plucky lot she cared for idols when I kissed 'er where she stud!
On the road to Mandalay...

would become

'Er petticoat was yaller an' 'er little cap was green,An' 'er name was Supi-yaw-lat - jes' the same as Theebaw's Queen,An' I seed her first a-smokin' of a whackin' white cheroot,An' a-wastin' Christian kisses on an 'eathen idol's foot:Bloomin' idol made o' mudWot they called the Great Gawd BuddPlucky lot she cared for idols when I kissed 'er where she stud!On the road to Mandalay...

We probably want to replace it with a space rather than strip to avoid lines joining.

@phuedx what do you think?

Given lack of responses I guess we should go ahead with this?

👍

We probably want to replace it with a space rather than strip to avoid lines joining.

👍

@phuedx should we do this? This one seems quite a trivial fix so I'm open to fixing it, but I'm also wary we'd be adding more complexity to an API with many many problems so an alternative approach would be to add a warning as part of T170617

Jdlrobson raised the priority of this task from Low to Normal.Jul 13 2017, 6:51 PM

@phuedx should we do this? This one seems quite a trivial fix so I'm open to fixing it, but I'm also wary we'd be adding more complexity to an API with many many problems so an alternative approach would be to add a warning as part of T170617

Yeah! It'll make HTML extracts easier to consume – arguably – and there's been no pushback about it on the mailing list.

phuedx updated the task description. (Show Details)Jul 14 2017, 1:13 PM
Jdlrobson lowered the priority of this task from Normal to Low.