Page MenuHomePhabricator

Missing Extract info for Manhattan articles on Wikivoyage
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

When I run a query to load articles with 20km of particular coordinates on wikivoyage, sometimes I do not get any "extract" information for some of those locations. An example is New York - many of the Manhattan articles do not show extract info, despite there being quite a good amount of useful text at the start of an article. This is my test request:

https://en.wikivoyage.org/w/api.php?action=query&format=json&generator=geosearch&ggsprimary=all&prop=coordinates|pageimages|extracts|pageprops&coprimary=all&piprop=thumbnail&exintro=1&explaintext=1&origin=*&ggscoord=40.71277537694216%7C-74.00597289204597&ggsradius=20000&ggslimit=100&colimit=100&pithumbsize=400

And the Manhattan/Soho block looks like:

"21190": {
  "pageid": 21190,
  "ns": 0,
  "title": "Manhattan/SoHo",
  "index": 29,
  "coordinates": [
    {
      "lat": 40.723056,
      "lon": -74.000833,
      "primary": "",
      "globe": "earth"
    }
  ],
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/NYC_The_Wall_-_Gateway_to_Soho.JPG/500px-NYC_The_Wall_-_Gateway_to_Soho.JPG",
    "width": 400,
    "height": 300
  },
  "pageprops": {
    "geocrumb-is-in": "21171",
    "kartographer_frames": "1",
    "kartographer_links": "61",
    "page_image_free": "NYC_The_Wall_-_Gateway_to_Soho.JPG",
    "wikibase-badge-Q17559452": "",
    "wikibase_item": "Q461572"
  }
}

If I do a similar search for Rome:

https://en.wikivoyage.org/w/api.php?action=query&format=json&generator=geosearch&ggscoord=41.8967068%7C12.4822025&ggsradius=20000&colimit=50&ggslimit=50&ggsprimary=all&prop=coordinates%7Cpageimages%7Cextracts%7Cpageprops&coprimary=all&piprop=thumbnail&pithumbsize=400&exintro=1&explaintext=1&origin=*

The Rome/South block looks like:

"30022": {
  "pageid": 30022,
  "ns": 0,
  "title": "Rome/South",
  "index": 11,
  "coordinates": [
    {
      "lat": 41.8606,
      "lon": 12.5139,
      "primary": "",
      "globe": "earth"
    }
  ],
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Square_colosseum.jpg/500px-Square_colosseum.jpg",
    "width": 400,
    "height": 454
  },
  "extract": "The South of Rome includes the historic Appian Way and nearby catacombs, as well as important tourist attractions in EUR, and San Paolo.",
  "pageprops": {
    "geocrumb-is-in": "29990",
    "kartographer_frames": "1",
    "kartographer_links": "36",
    "page_image_free": "Square_colosseum.jpg",
    "wikibase-badge-Q17559452": "",
    "wikibase_item": "Q14231300"
  }
}

It has the extract block, which I feel the SoHo article should have too. I've had a look at the source text behind each page and cannot see any discernable difference in the structure that would cause the extract to be ommitted from the former.

What happens?:

No extract text returned for SoHo (or TriBeCa and many other in the Manhattan area)

What should have happened instead?:

Extract info should be returned

Event Timeline

A_smart_kitten subscribed.

As far as I can see, it looks like your first-linked API request includes extracts for the first 20 results. The second-linked API request has less than 20 results in total, and includes an extract for all of them.

Looking at the TextExtracts code, it appears that it will only include extracts for a maximum of 20 pages in one API request. This feels like a similar case to T410886#11412565; except, in this instance, at first glance, I don't think there's a way to get a text extract for more than 20 pages in the same API request. Hopefully the docs at https://www.mediawiki.org/wiki/API:Continue might be helpful for you in this case :-)

Thanks for the reply. It does seem consistent, and I noticed that for London too (another big city with 33 articles) there were a few missing too. It's a little unfortunate that it doesn't continue to load the extract for the rest of them, as it is already loading plenty of other useful data for me. But I can only assume that there must be a good reason to stop at 20.

I'll take a look at the continue option, but I'll give it a couple of days in the hope that someone else can jump in a maybe provide a solution. My other option is in my code to just load the missing extract data as a when someone actually needs it.

Thanks for your reply.

But I can only assume that there must be a good reason to stop at 20.

(For what it's worth, the limit of 20 appears to have originally been set in https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/MobileFrontend/+/7f0702873ee6cc1a14489a65fe02975c14d69d08%5E%21/ / https://static-codereview.wikimedia.org/MediaWiki/113904.html in 2012, with reference to performance concerns)

Thanks for finding the source of that. Probably a reasonable concern back in 2012 - maybe less so now. Maybe someone will see this and think it could be updated (or reverted! ;) ). As I said earlier, I'll give it a couple of days, then implement some logic using continue if it looks like it will stay as is.