API problems with double spaces in wiki sections
Closed, InvalidPublic
Actions

Assigned To

None

Authored By

	Kroeliebuschie
	Jan 22 2018, 11:27 AM

Description

When I use this api query, a list of sections of that page is returned:

https://nl.wikipedia.org/w/api.php?action=parse&prop=sections&page=Samenstelling_Tweede_Kamer_1875-1879&format=json

The following query returns the content of the page:

https://nl.wikipedia.org/w/api.php?explaintext=&rvprop=ids&titles=Samenstelling_Tweede_Kamer_1875-1879&prop=extracts|revisions&action=query&format=json

The problem I'm getting is that one of the sections titles (returned by the first query) has 2 spaces

Conservatief-Katholiek  (14 zetels)

While the same section title in the body (returned by the second query) does not have this double space:

Conservatief-Katholiek (14 zetels)

I noticed this while using the wikipedia python package. To make my script work I removed the double space from this wikipage, so in order to reproduce this error, the edit has to be removed.

I'm not sure if this is a bug or a feature

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T169242 Develop Page Content Service for Reading Clients
Resolved	None	T177425 Develop General Layer of PCS
Resolved	• Jhernandez	T177426 Develop structured JSON APIs for general consumption
Resolved	• Mholloway	T177431 Develop a Summary JSON API
Resolved	Dereckson	T68374 Enable Hovercards on se.wikimedia.org (Swedish chapter wiki)
Resolved	Jdlrobson	T70860 [GOAL] Graduate Page Previews feature (Popups extension) out of Beta Feature
Resolved	ovasileva	T154635 [EPIC] Deploy page previews to English and German Wikipedia
Resolved	ovasileva	T192622 [EPIC] Page previews post-deploy cleanup
Resolved	Jdlrobson	T173952 Remove A/B testing instrumentation code
Duplicate	None	T167433 Switch all projects to the new (and yet to be built) summary-html endpoint for page previews
Duplicate	None	T167429 Make enwiki and dewiki fetch previews from the summary-html RESTBase endpoint
Resolved	ovasileva	T165018 Page previews can consume new summary-HTML endpoint
Declined	Jdlrobson	T111329 [GOAL] Page previews on mobileweb
Resolved	Jdlrobson	T164010 [EPIC] Strengthen the APIs we provide in reading web maintained extensions
Resolved	ovasileva	T113094 [EPIC] The Page Summary API needs to provide useful content for the majority of articles
Resolved	Jdlrobson	T170617 Adjust expectations for API consumers when using the TextExtracts API
Invalid	None	T185472 API problems with double spaces in wiki sections

Event Timeline

Kroeliebuschie created this task.Jan 22 2018, 11:27 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 22 2018, 11:27 AM

I saw the following in the documentation:

sections: Gives the sections in the parsed wikitext (TOC, table of contents).

and:

explaintext

Return extracts as plain text instead of limited HTML. 
Type: boolean (details)

exsectionformat

How to format sections in plaintext mode:

plain
    No formatting.
wiki
    Wikitext-style formatting (== like this ==).
raw
    This module's internal representation (section titles prefixed with <ASCII 1><ASCII 2><section level><ASCII 2><ASCII 1>).

One of the following values: plain, wiki, raw
Default: wiki

It appears that the prop=explaintext parameter returns texts in wiki formatting by default . Is there a way to get this formatting for the sections as well? If not, this issue can be turned into a feature request

If you want the actual wikitext rather than the text as adjusted by TextExtracts, use [[https://nl.wikipedia.org/w/api.php?rvprop=ids%7Ccontent&titles=Samenstelling_Tweede_Kamer_1875-1879&prop=revisions&action=query&format=json|rvprop=content]] instead.

If you want section headings from TextExtracts in plain text rather than having them back-converted to wikitext-like syntax, try [[https://nl.wikipedia.org/w/api.php?explaintext=&rvprop=ids&titles=Samenstelling_Tweede_Kamer_1875-1879&prop=extracts%7Crevisions&action=query&format=json&exsectionformat=plain|exsectionformat=plain]].

Anomie moved this task from Unsorted to Non-core-API stuff on the MediaWiki-Action-API board.Jan 22 2018, 3:46 PM

@Anomie Thank you, those parameters are used for the content queries. But I was actually wondering if there is a formatting option for prop=sections. Something like exsectionformat=wiki but then for the section query (the first one in my first message). It looks like its using plain by default.

There isn't, and since the data is generated deep in the Parser it's unlikely one will be added. And it's not "plain" format, it's a limited HTML.

Your problem, fundamentally, is that you're trying to match up the untided and semi-stripped HTML from half-parsed wikitext used to generate the table of contents entry with tided and differently-stripped HTML from fully-parsed wikitext that TextExtracts produces. Most of the time you can probably match that up in various ways, but there's almost certainly going to be cases that break it.

@Kroeliebuschie TextExtracts actually has quite a few issues and we're not planning to fix them. It would be worth following T113094 (a new API that will be deployed within the next few weeks) . See also T170617

Jdlrobson added a parent task: T170617: Adjust expectations for API consumers when using the TextExtracts API.Jan 22 2018, 11:04 PM

@Anomie @Jdlrobson thank you for explaining the problem. This problem was causing a bug in a python module. I made a pull request that fixes the issue (can be seen bellow the issue description).

Jdlrobson moved this task from Untriaged to Discuss further on the Web-Team-Backlog (Tracking) board.Jan 30 2018, 5:51 PM

Issue was in Python module it seems.

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:28 PM

API problems with double spaces in wiki sectionsClosed, InvalidPublicActions

Description

Related ObjectsSearch...

Event Timeline

API problems with double spaces in wiki sections
Closed, InvalidPublic
Actions

Related Objects
Search...