Page MenuHomePhabricator

Errors in production after 11/13 mobileapps deployment
Closed, ResolvedPublic

Description

Lots of errors occurred after the 11/13 mobileapps deployment. They appear mostly of two varieties.

The most common type of error is from requests for pages outside the main namespace, usually File pages but also some Category pages, hitting the mobile-sections endpoint for commons.wikimedia.org:

{"name":"mobileapps","hostname":"scb1002","pid":1209,"level":50,"message":"504: internal_http_error","status":504,"type":"internal_http_error","detail":"504: internal_http_error","levelPath":"error/504","request_id":"b7239f98-c8db-11e7-ad03-fc45416fb56e","request":{"url":"/commons.wikimedia.org/v1/page/mobile-sections/File%3AHickman_Natural_Bridge%2C_Capitol_Reef%2C_Utah%2C_June_2011_-_panoramio.jpg","headers":{"x-request-id":"b7239f98-c8db-11e7-ad03-fc45416fb56e","content-length":"0"},"method":"GET","params":{"0":"/commons.wikimedia.org/v1/page/mobile-sections/File:Hickman_Natural_Bridge,_Capitol_Reef,_Utah,_June_2011_-_panoramio.jpg"},"query":{},"remoteAddress":"10.64.48.10","remotePort":51494},"msg":"504: internal_http_error","time":"2017-11-14T01:32:46.432Z","v":0}

Oddly, these requests seemed to complete successfully using curl locally from SCB.

And another, less frequent error occurred for the featured-image route:

{"name":"mobileapps","hostname":"scb1002","pid":1227,"level":50,"message":"500: internal_error","stack":"TypeError: Cannot read property 'payload' of undefined\n    at Request.featured.promise.then (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/src/routes/featured-image.js:29:33)\n    at Request.tryCatcher (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:512:31)\n    at Promise._settlePromise (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:569:18)\n    at Promise._settlePromise0 (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:614:10)\n    at Promise._settlePromises (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:693:18)\n    at Promise._fulfill (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:638:18)\n    at Promise._resolveCallback (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:432:57)\n    at Promise._settlePromiseFromHandler (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:524:17)\n    at Promise._settlePromise (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:569:18)\n    at Promise._settlePromise0 (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:614:10)\n    at Promise._settlePromises (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/promise.js:689:18)\n    at Async._drainQueue (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/async.js:133:16)\n    at Async._drainQueues (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/async.js:143:10)\n    at Immediate.Async.drainQueues (/srv/deployment/mobileapps/deploy-cache/revs/9b10959551a936e5c69cc83104a08c2cbfad45e9/node_modules/bluebird/js/release/async.js:17:14)\n    at runCallback (timers.js:672:20)","status":500,"type":"internal_error","detail":"Cannot read property 'payload' of undefined","levelPath":"error/500","request_id":"64a44c53-c8db-11e7-bc1f-72c8cd949c96","request":{"url":"/en.wikipedia.org/v1/media/image/featured/2016/04/29","headers":{"x-request-id":"64a44c53-c8db-11e7-bc1f-72c8cd949c96"},"method":"GET","params":{"0":"/en.wikipedia.org/v1/media/image/featured/2016/04/29"},"query":{},"remoteAddress":"10.64.16.21","remotePort":41000},"msg":"500: internal_error","time":"2017-11-14T01:30:27.048Z","v":0}

Both seem to have abated after rolling back the deployment.

I left the 11/13 deployment running on the Beta Cluster for testing purposes.

Event Timeline

Change 391157 had a related patch set uploaded (by BearND; owner: BearND):
[mediawiki/services/mobileapps@master] featured-image: better diagnostics in error case

https://gerrit.wikimedia.org/r/391157

The first one is strange. I can't repro the first that one on labs or locally. Not sure it's possible to repro on the beta cluster. The only place in our codebase I find internal_http_error is the preq library.

I've got an idea about the second one (see patch above). That one has at least a stack trace. This can happen when an Error is thrown inside the featured-image.promise so it goes into the catch block but status is not 504. I just replace the return statement at the end of one of the two then blocks with throw new Error(). That will result in the response object in the route at line 28 + 29 to be undefined.

For the first error, looking at the logs, it looks like most if not all of the errors were occurring on host restbase-dev1006. I'm not sure what that implies.

Here's a typical example:

{
  "_index": "logstash-2017.11.13",
  "_type": "restbase-dev",
  "_id": "AV-3qV_7-DyidI0Flayt",
  "_score": null,
  "_source": {
    "error_headers_access-control-expose-headers": "etag",
    "root_req_headers_content-length": "0",
    "root_req_method": "get",
    "root_req_headers_cache-control": "no-cache",
    "error_body_method": "post",
    "error_headers_x-content-security-policy": "default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'",
    "pid": 14,
    "type": "restbase-dev",
    "error_headers_content-type": "application/json; charset=utf-8",
    "error_headers_vary": "Accept-Encoding",
    "error_headers_x-content-type-options": "nosniff",
    "levelPath": "error/mobileapps",
    "error_body_detail": "504: internal_http_error",
    "error_headers_x-frame-options": "SAMEORIGIN",
    "error_body_status": 504,
    "root_req_headers_x-triggered-by": "req:803b5807-7c33-4d4a-a665-2472278b9c5c,resource_change:https://commons.wikimedia.org/wiki/File:Yellands,_near_Whimple_-_geograph.org.uk_-_178286.jpg,resource_change:http://commons.wikimedia.org/api/rest_v1/page/html/File%3AYellands%2C_near_Whimple_-_geograph.org.uk_-_178286.jpg",
    "error_headers_x-webkit-csp": "default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'",
    "host": "restbase-dev1006",
    "@version": "1",
    "error_headers_access-control-allow-origin": "*",
    "gelf_level": "3",
    "error_headers_content-security-policy": "default-src 'self'; object-src 'none'; media-src *; img-src *; style-src *; frame-ancestors 'self'",
    "root_req_uri": "/commons.wikimedia.org/v1/page/mobile-sections/File%3AYellands%2C_near_Whimple_-_geograph.org.uk_-_178286.jpg?redirect=false",
    "error_status": 504,
    "root_req_headers_x-request-class": "internal_update",
    "error_body_uri": "http://api-rw.discovery.wmnet/w/api.php",
    "error_message": "504: internal_http_error",
    "error_name": "HTTPError",
    "source_host": "10.64.48.10",
    "level": "ERROR",
    "error_headers_connection": "close",
    "message": "Error fetching old mobile content",
    "root_req_headers_x-client-ip": "::ffff:10.192.16.151",
    "version": "1.0",
    "root_req_headers_user-agent": "ChangePropagation-Dev/WMF",
    "normalized_message": "Error fetching old mobile content",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "@timestamp": "2017-11-13T23:14:11.427Z",
    "root_req_headers_x-forwarded-for": "::ffff:10.192.16.151",
    "error_headers_access-control-allow-headers": "accept, x-requested-with, content-type",
    "error_headers_x-xss-protection": "1; mode=block",
    "error_headers_date": "Mon, 13 Nov 2017 23:14:11 GMT",
    "error_body_type": "internal_http_error",
    "request_id": "803b5807-7c33-4d4a-a665-2472278b9c5c",
    "root_req_headers_x-request-id": "803b5807-7c33-4d4a-a665-2472278b9c5c"
  },
  "fields": {
    "@timestamp": [
      1510614851427
    ]
  },
  "sort": [
    1510614851427
  ]
}

restbase-dev1006 is a part of our dev cluster. WE are not currently using it however there's a change-prop instance running there - that's the source of errors from there. I'll stop change-prop in dev cluster

Mholloway claimed this task.

All quiet after today's redeploy. I guess I'll call it a win.

Change 391157 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] featured-image: better diagnostics in error case

https://gerrit.wikimedia.org/r/391157