Page MenuHomePhabricator

[Spike 2 hours] Can we get lead sections returned in search results?
Closed, ResolvedPublic

Event Timeline

Fjalapeno raised the priority of this task from to Needs Triage.
Fjalapeno updated the task description. (Show Details)
Fjalapeno subscribed.

Assigning to Brian - he already started on it last night.

Answer is yes:

You can use search as a generator for extracts (and any other query "props"), which gives you the first N characters of each result's "intro", where N is a number specified in the exchars query param. Here's the first result when searching for Barack Obama as a generator for extracts, pageprop, and pageterms:
#action=query&prop=extracts|pageprops|pageterms&format=json&exchars=150&wbptterms=description&generator=search&gsrsearch=Barack%20Obama

note the "extract" field

"534366": {
    "pageid": 534366,
    "ns": 0,
    "title": "Barack Obama",
    "index": 1,
    "extract": "<p><b>Barack Hussein Obama II</b> (<span><small>US</small> <span>/<span><span title=\"'b' in 'buy'\">b</span><span title=\"/\u0259/ 'a' in 'about'\">\u0259</span></span></span></span></p>...",
    "pageprops": {
        "defaultsort": "Obama, Barack",
        "page_image": "President_Barack_Obama.jpg",
        "wikibase_item": "Q76"
    },
    "terms": {
        "description": [
            "44th President of the United States"
        ]
    }
}

@BGerstle-WMF looks like it is only giving back the extract for the first result.

Yeah, something unique to textextracts. @MaxSem would you mind enlightening us as to why textextracts doesn't properly handle generator input?

[10:36:03] <MaxSem> not that it doesn't support, but you can retrieve only 1 full extract per request
[10:36:16] <MaxSem> because of a bad worst-case behavior

Note that lead sections (&exintro=1) don't have this limitation.

Got it, thanks @MaxSem! the API confused me in that you need to specify both exintro and exlimit to ensure you get the number of extracts you wanted, which is limited to 20. This might be remedied w/ API & doc adjustments mentioned in T102856 and this talk thread.

@Jhernandez @phuedx @Fjalapeno, here's the working query:
#action=query&prop=extracts|pageprops|pageterms&format=json&exchars=150&exlimit=20&exintro=&wbptterms=description&generator=search&gsrsearch=Barack%20Obama&gsrlimit=20

"query": {
    "pages": {
        "534366": {
            "pageid": 534366,
            "ns": 0,
            "title": "Barack Obama",
            "index": 1,
            "extract": "<p><b>Barack Hussein Obama II</b> (<span><small>US</small> <span>/<span><span title=\"'b' in 'buy'\">b</span><span title=\"/\u0259/ 'a' in 'about'\">\u0259</span></span></span></span></p>...",
            "pageprops": {
                "defaultsort": "Obama, Barack",
                "page_image": "President_Barack_Obama.jpg",
                "wikibase_item": "Q76"
            },
            "terms": {
                "description": [
                    "44th President of the United States"
                ]
            }
        },
        "20779076": {
            "pageid": 20779076,
            "ns": 0,
            "title": "Barack Obama: Der schwarze Kennedy",
            "index": 13,
            "extract": "<p><i><b>Barack Obama \u2013 Der schwarze Kennedy</b></i> (English: Barack Obama \u2013 The black Kennedy) is a best-selling German-language biography of President</p>...",
            "pageprops": {
                "defaultsort": "Barack Obama - Der Schwarze Kennedy",
                "displaytitle": "<i>Barack Obama: Der schwarze Kennedy</i>",
                "wikibase_item": "Q4858129"
            },
            "terms": {
                "description": [
                    "2007 German biography of Barack Obama"
                ]
            }
        },
        ...

If we start using this, we'll definitely need to talk to rel-eng & ops