Page MenuHomePhabricator

Parsing json API query.search results breaks URL with .searchmatch in URL
Closed, InvalidPublic

Description

I have problems parsing a json API search result snippet. The searchmatch span breaks the link.

Add this plain URL as one-liner text to Category:John Foxx:

http://www.discogs.com/artist/John+Foxx

Now query.search for "John Foxx" category titles and parse the results (including the snippet).

The returned HTML is destroyed because span.searchmatch wraps "John" inside the URL path (I guess):

Output:

<a rel="nofollow" class="external free" href="http://www.discogs.com/artist/">http://www.discogs.com/artist/</a><span class="searchmatch">John</span>+Foxx

API call:

api.php?action=query&list=search&format=json&srsearch=John%20Foxx&srnamespace=14&srprop=snippet&srredirects=&srlimit=50
>>> function(data){ 

                       ... build myWrappedResultItems

                      $.ajax({ type: 'POST', url: '...api.php', dataType: 'json',
                              data: { format: 'json', action: 'parse', text: myWrappedResultItems }
                             })
                      .done(function(res) {
                          ... res.parse.text['*']
                      });
}

MW 1.22.12

Event Timeline

Subfader raised the priority of this task from to Needs Triage.
Subfader updated the task description. (Show Details)
Subfader added a project: MediaWiki-Action-API.
Subfader changed Security from none to None.
Subfader updated the task description. (Show Details)
Subfader updated the task description. (Show Details)
Subfader updated the task description. (Show Details)
Subfader subscribed.
Subfader updated the task description. (Show Details)
Subfader updated the task description. (Show Details)
Subfader updated the task description. (Show Details)
Subfader updated the task description. (Show Details)
Anomie subscribed.

You haven't given enough information to actually see your problem in action, but it seems clear that this has nothing to do with the API itself. More likely you're having a problem with the data being returned by the underlying search engine or something along those lines.

Aklapper changed the task status from Open to Stalled.Dec 8 2014, 11:15 AM

Returned snippet in nowiki tag:

[http://www.facebook.com/<span class='searchmatch'>visionquest</span>.official?sk=wall Facebook]

Dirty hacking bugfix:

var snippet = item.snippet.replace(/<span class='searchmatch'>(.+)<\/span>/g, "$1");

@Subfader, I work on search and maybe I can help. Can you give me more context? Which wiki has this problem? If it is your own which search backend is it using?

Search on wikimedia hosted wikis treats the article as text rather than html, even going to far as to squash the rendered html directly into text. We escape any html that comes back from the search backend other than the searchmatch span so its not possible to use the search page as a delivery mechanism for attacks.

I couldn't reproduce the behaviour on mediawiki.org since the snippet is not returning the complete source text as on my MW 1.22.12 But the titlesnippet just returns the same false source code:

https://www.mediawiki.org/wiki/Special:ApiSandbox#action=query&list=search&format=json&srsearch=APIsnippetTest&srnamespace=2&srwhat=text&srprop=snippet%7Ctitlesnippet

{
    "warnings": {
        "query": {
            "*": "Formatting of continuation data will be changing soon. To continue using the current formatting, use the 'rawcontinue' parameter. To begin using the new format, pass an empty string for 'continue' in the initial query."
        }
    },
    "query": {
        "searchinfo": {
            "totalhits": 1
        },
        "search": [
            {
                "ns": 2,
                "title": "User:Subfader/APIsnippetTest",
                "snippet": "Facebook Foobar",
                "titlesnippet": "User:Subfader/<span class=\"searchmatch\">APIsnippetTest</span>"
            }
        ]
    }
}

Then parse the output of titlesnippet...

In snippet that would be

[http://www.facebook.com/<span class=\"searchmatch\">APIsnippetTest</span>.official?sk=wall Facebook]

Snippets aren't supposed to be parsed though. They are meant to be displayed because they show the matching part. Even when searching wikitext the text snippet itself doesn't even make an effort to not cut off in the middle of syntax. For example:
https://en.wikipedia.org/w/index.php?title=Special%3ASearch&profile=default&search=insource%3Apredecessor2&fulltext=Search

Its just not _for_ that so its unlikely that this'll work how you want it to consistently.

You're right of course. I just looped through each result formatting each result and joing all in a var which I then parse altogether.

The ticket can be closed I guess.