Page MenuHomePhabricator

es.wikipedia.org TextExtracts is not stripping citation markup for plaintext
Open, Needs TriagePublic

Description

The es.wikipedia.org mediawiki API does not seem to be properly stripping out citation markup when article text is requested in plaintext.

For example, in https://es.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=extracts|revisions|info&titles=CNN&rvprop=timestamp&exlimit=20&exintro=true&explaintext=true&exsectionformat=plain&redirects&exsentences=2&inprop=url, the output still has markup like [1]. (This also seems to be causing the sentence-counting in exsentences to be thrown off)

{
    "batchcomplete": "",
    "query": {
        "pages": {
            "42587": {
                "pageid": 42587,
                "ns": 0,
                "title": "CNN",
                "extract": "Cable News Network (m\u00e1s conocido por sus siglas CNN) es un canal de televisi\u00f3n por suscripci\u00f3n estadounidense fundado en 1980 por el empresario Ted Turner. Actualmente es parte de Time Warner, y es operada por Turner Broadcasting System, una subsidiaria de Time Warner.[1]\u200b CNN fue la primera cadena de televisi\u00f3n en cubrir noticias las 24 horas del d\u00eda.[2]\u200b y el primer canal de noticias de Estados Unidos.[3]\u200b\nDesde su lanzamiento el 1 de junio de 1980,[4]\u200b la cadena se ha expandido notablemente, incluyendo en la actualidad 15 cadenas de televisi\u00f3n de cable y sat\u00e9lite, doce sitios web y dos cadenas de radio.",
                "revisions": [
                    {
                        "timestamp": "2018-05-23T22:30:11Z"
                    }
                ],
                "contentmodel": "wikitext",
                "pagelanguage": "es",
                "pagelanguagehtmlcode": "es",
                "pagelanguagedir": "ltr",
                "touched": "2018-06-07T17:42:33Z",
                "lastrevid": 108075576,
                "length": 59369,
                "fullurl": "https://es.wikipedia.org/wiki/CNN",
                "editurl": "https://es.wikipedia.org/w/index.php?title=CNN&action=edit",
                "canonicalurl": "https://es.wikipedia.org/wiki/CNN"
            }
        }
    }
}

This is unlike the behavior in most other languages, as en.wikipedia.org, fr.wikipedia.org, and zh.wikipedia.org (among others) _do_ properly strip this out.

Event Timeline

Restricted Application added subscribers: Cosine02, Aklapper. · View Herald TranscriptJun 14 2018, 8:37 PM
Legoktm renamed this task from es.wikipedia.org REST API is not stripping citation markup for plaintext API requests to es.wikipedia.org TextExtracts is not stripping citation markup for plaintext.
Jdlrobson updated the task description. (Show Details)
Jdlrobson added a subscriber: Jdlrobson.

There are quite a few issues with TextExtracts and not many plans to fix them: https://www.mediawiki.org/wiki/Extension:TextExtracts#Caveats

Are you able to use the REST summary endpoint for your needs ?

Oh, interesting, can we add that to the TextExtracts documentation that the REST API is preferred? I saw that disclaimer section, but it didn't say anything about this issue with plaintext extraction having this bug, or about the REST API. Our current workaround is to strip out any text in-between square-brackets.

Also, is this REST /page/summary API capable of taking in a batched call with multiple page titles at once, similar to how we can do so for https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=extracts|revisions|info&titles=CNN|WaPo&rvprop=timestamp&exlimit=20&exintro=true&explaintext=true&exsectionformat=plain&redirects&exsentences=2&inprop=url ? It seems like this is only able to do one page per API call, and I'd like to avoid having to repeatedly hit the API with separate curl requests if I can.

It also seems like the equivalent of getting 'canonical_url' in the TextExtract API call I'm currently using is to use the 'content_urls' field that seems to be present in the response of this REST API call. However, the field seems undocumented, so is it safe to use this field?

I guess my question is whether staying with the MediaWiki TextExtracts API is okay, given that we already have workarounds in place? (and switching to the new API means losing the ability to batch calls)

Unless there's plans to deprecate TextExtracts, I imagine we can probably stay with using it while keeping this bug in mind.

Yes, unfortunately the batch calls are lost with switching to the API so you'd need to make use of additional API requests.
API is not officially deprecated but in practice, we're not currently actively working on it. I'd suggest thinking of it as being in maintenance mode.

Some warnings are left in the API requests themselves (per T170617) . It says: "HTML may be malformed and/or unbalanced and may omit inline images. Use at your own risk. Known problems are listed at https://www.mediawiki.org/wiki/Extension:TextExtracts#Caveats."
https://en.wikipedia.beta.wmflabs.org/wiki/Special:ApiSandbox#action=query&format=json&prop=extracts&titles=Dog

Thanks for this bug report I've also documented this issue on the TextExtracts extension page and sorry I couldn't be of more help. I've also updated the page to point to the Summary endpoint.

Are there any plans to add batched-call support to the REST API in future versions? That's the really the only blocker for me being able to switch, since we're likely going to need batching more and more as we scale up our product.

Not that I know of. I believe the REST requests are quite cheap, so there's no problem with hitting it 10 times for 10 pieces of data and stitching the result together.

Might be worth creating a task inside Services asking if they do or digging around the documentation inside https://www.mediawiki.org/wiki/RESTBase

Vvjjkkii renamed this task from es.wikipedia.org TextExtracts is not stripping citation markup for plaintext to pzaaaaaaaa.Jul 1 2018, 1:03 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
Mainframe98 renamed this task from pzaaaaaaaa to es.wikipedia.org TextExtracts is not stripping citation markup for plaintext.Jul 1 2018, 7:05 AM
Mainframe98 raised the priority of this task from High to Needs Triage.
Mainframe98 updated the task description. (Show Details)
Mainframe98 added a subscriber: Aklapper.