[Bug] Beta cluster page summary endpoint sometimes reponds with 5xx
Closed, ResolvedPublic

Description

Certain page preview requests from the dog beta cluster page consistently respond with HTTP 5xx:

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Origin_of_the_domestic_dog
{
  "type": "https://mediawiki.org/wiki/HyperSwitch/errors/internal_http_error",
  "method": "post",
  "detail": "Error: connect EHOSTUNREACH 10.68.19.128:80",
  "uri": "http://deployment-mediawiki04.deployment-prep.eqiad.wmflabs/w/api.php"
}

..And another page preview request from the same page:

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Precambrian
{
  "type": "https://mediawiki.org/wiki/HyperSwitch/errors/internal_http_error",
  "method": "post",
  "detail": "504: internal_http_error",
  "uri": "http://deployment-mediawiki-07.deployment-prep.eqiad.wmflabs/w/api.php"
}

When I make this request locally (through the Mobile Content Service, not RESTBase), it works fine:

http://localhost:6927/en.wikipedia.beta.wmflabs.org/v1/page/summary/Precambrian
{
  "type": "standard",
  "title": "Precambrian",
  "displaytitle": "Precambrian",
  "namespace": {
    "id": 0,
    "text": ""
  },
  "titles": {
    "canonical": "Precambrian",
    "normalized": "Precambrian",
    "display": "Precambrian"
  },
  "pageid": 93198,
  "lang": "en",
  "dir": "ltr",
  "revision": "214715",
  "tid": "528e8ba5-1295-11e8-bee0-b33e638c54d7",
  "timestamp": "2015-03-31T22:15:13Z",
  "content_urls": {
    "desktop": {
      "page": "https://en.wikipedia.beta.wmflabs.org/wiki/Precambrian",
      "revisions": "https://en.wikipedia.beta.wmflabs.org/wiki/Precambrian?action=history",
      "edit": "https://en.wikipedia.beta.wmflabs.org/wiki/Precambrian?action=edit",
      "talk": "https://en.wikipedia.beta.wmflabs.org/wiki/Talk:Precambrian"
    },
    "mobile": {
      "page": "https://en.m.wikipedia.beta.wmflabs.org/wiki/Precambrian",
      "revisions": "https://en.m.wikipedia.beta.wmflabs.org/wiki/Special:History/Precambrian",
      "edit": "https://en.m.wikipedia.beta.wmflabs.org/wiki/Precambrian?action=edit",
      "talk": "https://en.m.wikipedia.beta.wmflabs.org/wiki/Talk:Precambrian"
    }
  },
  "api_urls": {
    "summary": "https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Precambrian",
    "edit_html": "https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Precambrian",
    "talk_page_html": "https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/html/Talk:Precambrian"
  },
  "extract": "(Blank.)",
  "extract_html": "<p>(Blank.)</p>"
}

I believe that the underlying MediaWiki API request is:

https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&lllimit=max&pilicense=any&piprop=thumbnail%7Coriginal%7Cname&pithumbsize=320&wbptterms=description&inprop=protection&rvprop=ids%7Ctimestamp%7Cuser%7Ccontentmodel&titles=Precambrian&prop=coordinates%7Cpageprops%7Cpageimages%7Cpageterms%7Crevisions%7Cinfo%7Clanglinks%7Ccategories&clprop=hidden&cllimit=50&format=json&formatversion=2&continue=
{
  "batchcomplete": true,
  "query": {
    "pages": [
      {
        "pageid": 93198,
        "ns": 0,
        "title": "Precambrian",
        "revisions": [
          {
            "revid": 214715,
            "parentid": 0,
            "user": "Jdforrester (WMF)",
            "timestamp": "2015-03-31T22:15:13Z",
            "contentmodel": "wikitext"
          }
        ],
        "contentmodel": "wikitext",
        "pagelanguage": "en",
        "pagelanguagehtmlcode": "en",
        "pagelanguagedir": "ltr",
        "touched": "2015-03-31T22:15:14Z",
        "lastrevid": 214715,
        "length": 8,
        "new": true,
        "protection": [],
        "restrictiontypes": [
          "edit",
          "move"
        ]
      }
    ]
  },
  "limits": {
    "langlinks": 500
  }
}

Which also works fine from my machine.

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
ovasileva triaged this task as High priority.Apr 16 2018, 2:44 PM

@Niedzielski I restarted the beta cluster restbase and mobileapps services while testing out some theories, and curiously enough that seems to have resolved the immediate issue. Could you try to reproduce this again now? (Note that "Origin of the domestic dog" doesn't exist on beta and will correctly return 404.)

Mentioned in SAL (#wikimedia-releng) [2018-04-16T20:35:31Z] <mdholloway> restarted restbase and mobileapps services for testing (T192287)

Mholloway added a comment.EditedApr 16 2018, 8:52 PM

For posterity: there was a recent change to the config variable that sets the MW API URL for the node.js services on the beta cluster (https://gerrit.wikimedia.org/r/#/c/425822/). My idea was to try setting mobileapps back to the old URL for testing. I did that and restarted, and it didn't help, but after resetting to the current value and restarting again, it suddenly worked. This fixed the issue for the "Precambrian" page. Restarting restbase fixed the issue with "Origin of the domestic dog." The services are ordinarily restarted in the course of deployments, and neither had seen a code deployment since before the config change.

I don't have a particularly clear idea of what was going wrong but I guess there was some bad state around the MW API config var change that restarting cleared up.

I tried about 50 links and it seems to work. Thanks (and thanks for keeping a debugging record) @Mholloway!!

Pchelolo closed this task as Resolved.Apr 17 2018, 12:52 PM
Pchelolo edited projects, added Services (done); removed Services.
Pchelolo added a subscriber: Pchelolo.

The deployment-mediawiki04.deployment-prep.eqiad.wmflabs host was removed per T192071 - that explains the issue. I think this can be resolved now, please reopen if it comes back.