Page MenuHomePhabricator

Missing space between paragraphs in extract received using API (all wikis)
Open, Needs TriagePublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

What happens?:
The response contains "fikce.Dne" (no space)

What should have happened instead?:
The response should read "fikce. Dne" (yes space 🙂)

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc.:
The source of the page is the following:

Seriál nese prvky [[Psychodrama|psychologického dramatu]], [[science fiction]] a nadpřirozené fikce.<ref name="theglobeandmail.com">{{cite web|url=http://www.theglobeandmail.com/arts/television/doyle-netflixs-mysterious-the-oa-is-mind-bending-astounding-drama/article33344922/|title=John Doyle: Netflix's mysterious ''The OA'' is mind-bending drama|work=theglobeandmail.com|accessdate=December 18, 2016}}</ref><ref>{{cite web|url=http://www.telegraph.co.uk/on-demand/0/could-oa-new-stranger-things-know-far-netflixs-mysterious-new/|title=Could ''The OA'' be the new ''Stranger Things''? All we know so far about Netflix's mysterious new show|work=telegraph.co.uk|accessdate=December 19, 2016}}</ref> 

Dne 8. února 2017

Event Timeline

Hi everyone, any news on this one please?

@Soustruh: In general, in any Phabricator task, all news available can be found in the task. Thus no, no news.

This happens across all wikis, not just cswiki.

If you drop the "explaintext=1" parameter you can see a different version of the extract with some HTML markup between "fikce." and "Dne".

https://cs.wikipedia.org/wiki/Speci%C3%A1ln%C3%AD:API_p%C3%ADskovi%C5%A1t%C4%9B#action=query&format=json&prop=extracts&titles=OA&redirects=1&utf8=1&formatversion=2&exintro=1

The extract now includes "fikce.</p><p>Dne".

Another random example from enwiki. With explaintext=1 and without. With contains "programmingCaso", while without contains "programming</dt></dl><ul><li>Caso".. an obvious pattern emerges.

Whatever logic is removing whitespace between HTML tags is too aggressive for the plainer version of the text. Or maybe it needs to be smarter about tags and whitespace.. <i>, <b>, and <em> tags should not correspond to spaces, but <p>, <ul>, <li> and many others should.

As a workaround, you can use the HTML extract and do your own HTML-to-plain conversion, though that is also fraught with potential problems.

TJones renamed this task from Missing space between paragraphs in extract received using API (cswiki) to Missing space between paragraphs in extract received using API (all wikis).Dec 5 2023, 9:52 PM