API problems with double spaces in wiki sections
Closed, InvalidPublic

Description

When I use this api query, a list of sections of that page is returned:

https://nl.wikipedia.org/w/api.php?action=parse&prop=sections&page=Samenstelling_Tweede_Kamer_1875-1879&format=json

The following query returns the content of the page:

https://nl.wikipedia.org/w/api.php?explaintext=&rvprop=ids&titles=Samenstelling_Tweede_Kamer_1875-1879&prop=extracts|revisions&action=query&format=json

The problem I'm getting is that one of the sections titles (returned by the first query) has 2 spaces

Conservatief-Katholiek  (14 zetels)

While the same section title in the body (returned by the second query) does not have this double space:

Conservatief-Katholiek (14 zetels)

I noticed this while using the wikipedia python package. To make my script work I removed the double space from this wikipage, so in order to reproduce this error, the edit has to be removed.

I'm not sure if this is a bug or a feature

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2018, 11:27 AM

I saw the following in the documentation:

sections: Gives the sections in the parsed wikitext (TOC, table of contents).

and:

explaintext

Return extracts as plain text instead of limited HTML. 
Type: boolean (details)

exsectionformat

How to format sections in plaintext mode:

plain
    No formatting.
wiki
    Wikitext-style formatting (== like this ==).
raw
    This module's internal representation (section titles prefixed with <ASCII 1><ASCII 2><section level><ASCII 2><ASCII 1>).

One of the following values: plain, wiki, raw
Default: wiki

It appears that the prop=explaintext parameter returns texts in wiki formatting by default . Is there a way to get this formatting for the sections as well? If not, this issue can be turned into a feature request

Anomie added a subscriber: Anomie.

If you want the actual wikitext rather than the text as adjusted by TextExtracts, use rvprop=content instead.

If you want section headings from TextExtracts in plain text rather than having them back-converted to wikitext-like syntax, try exsectionformat=plain.

@Anomie Thank you, those parameters are used for the content queries. But I was actually wondering if there is a formatting option for prop=sections. Something like exsectionformat=wiki but then for the section query (the first one in my first message). It looks like its using plain by default.

There isn't, and since the data is generated deep in the Parser it's unlikely one will be added. And it's not "plain" format, it's a limited HTML.

Your problem, fundamentally, is that you're trying to match up the untided and semi-stripped HTML from half-parsed wikitext used to generate the table of contents entry with tided and differently-stripped HTML from fully-parsed wikitext that TextExtracts produces. Most of the time you can probably match that up in various ways, but there's almost certainly going to be cases that break it.

Jdlrobson added a subscriber: Jdlrobson.

@Kroeliebuschie TextExtracts actually has quite a few issues and we're not planning to fix them. It would be worth following T113094 (a new API that will be deployed within the next few weeks) . See also T170617

@Anomie @Jdlrobson thank you for explaining the problem. This problem was causing a bug in a python module. I made a pull request that fixes the issue (can be seen bellow the issue description).

Jdlrobson closed this task as Invalid.Mar 16 2018, 4:34 PM

Issue was in Python module it seems.