Page MenuHomePhabricator

mobileapps 500s following reboot of restbase1007
Closed, ResolvedPublic

Description

mobileapps experienced some downtime today, with about 7 minutes between failure and recovery:

15:25 -icinga-wm:#wikimedia-operations- PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: 
          /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via 
          mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead 
          returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-remaining/{title} 
          (retrieve remaining sections of en.wp main page via mobile-s
15:32 -icinga-wm:#wikimedia-operations- RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All 
          endpoints are healthy

looking at the logs on scb1001
with tail -20 /srv/log/mobileapps/main.log | jq 'select(.status == 500)'

1{
2 "name": "mobileapps",
3 "hostname": "scb1001",
4 "pid": 2521,
5 "level": 50,
6 "message": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
7 "stack": "HTTPError: 500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/src/lib/util.js:162:26\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:71:5)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at Function.process_params (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:330:12)\n at next (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:271:10)\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:67:12)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at Function.process_params (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:330:12)\n at next (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:271:10)\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:67:12)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at param (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:349:14)\n at param (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:365:14)",
8 "status": 500,
9 "type": "https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
10 "detail": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
11 "levelPath": "error/500",
12 "request_id": "df1d56fa-0bf6-4dcd-8c6e-c54103bc6b3c",
13 "request": {
14 "url": "/www.wikidata.org/v1/page/mobile-sections/Q17690226",
15 "headers": {
16 "x-request-id": "df1d56fa-0bf6-4dcd-8c6e-c54103bc6b3c",
17 "content-length": "0"
18 },
19 "method": "GET",
20 "params": {
21 "0": "/www.wikidata.org/v1/page/mobile-sections/Q17690226"
22 },
23 "query": {},
24 "remoteAddress": "10.64.48.133",
25 "remotePort": 60971
26 },
27 "msg": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
28 "time": "2016-06-21T14:33:34.545Z",
29 "v": 0
30}
31{
32 "name": "mobileapps",
33 "hostname": "scb1001",
34 "pid": 2503,
35 "level": 50,
36 "message": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
37 "stack": "HTTPError: 500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/src/lib/util.js:162:26\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:71:5)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at Function.process_params (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:330:12)\n at next (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:271:10)\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:67:12)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at Function.process_params (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:330:12)\n at next (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:271:10)\n at Layer.handle_error (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/layer.js:67:12)\n at trim_prefix (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:310:13)\n at /srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:280:7\n at param (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:349:14)\n at param (/srv/deployment/mobileapps/deploy-cache/revs/2e4d414647877e67cd0bcc775717b35fb61b53f6/node_modules/express/lib/router/index.js:365:14)",
38 "status": 500,
39 "type": "https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
40 "detail": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
41 "levelPath": "error/500",
42 "request_id": "5fd74013-0696-44a2-892b-43fd113d486d",
43 "request": {
44 "url": "/www.wikidata.org/v1/page/mobile-sections/Q24510758",
45 "headers": {
46 "x-request-id": "5fd74013-0696-44a2-892b-43fd113d486d",
47 "content-length": "0"
48 },
49 "method": "GET",
50 "params": {
51 "0": "/www.wikidata.org/v1/page/mobile-sections/Q24510758"
52 },
53 "query": {},
54 "remoteAddress": "10.64.32.80",
55 "remotePort": 41856
56 },
57 "msg": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
58 "time": "2016-06-21T14:46:14.744Z",
59 "v": 0
60}

mobileapps logs are not in logstash (as per T121571: Improve logstash logging for graphoid, mobileapps, mathoid) but taking for example request-id "5fd74013-0696-44a2-892b-43fd113d486d" yields two logs from restbase

{
  "_index": "logstash-2016.06.21",
  "_type": "restbase",
  "_id": "AVVzbQ_ww3dCNxx2ZHRh",
  "_score": null,
  "_source": {
    "host": "restbase1013",
    "level": "ERROR",
    "version": "1.0",
    "@version": "1",
    "@timestamp": "2016-06-21T14:46:14.750Z",
    "source_host": "10.64.32.80",
    "pid": 1963,
    "message": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
    "res.name": "HTTPError",
    "res.message": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
    "res.status": 500,
    "res.headers.access-control-allow-origin": "*",
    "res.headers.access-control-allow-headers": "accept, content-type",
    "res.headers.access-control-expose-headers": "etag",
    "res.headers.x-xss-protection": "1; mode=block",
    "res.headers.x-content-type-options": "nosniff",
    "res.headers.x-frame-options": "SAMEORIGIN",
    "res.headers.content-security-policy": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.x-content-security-policy": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.x-webkit-csp": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.content-type": "application/json; charset=utf-8",
    "res.headers.vary": "Accept-Encoding",
    "res.headers.date": "Tue, 21 Jun 2016 14:46:14 GMT",
    "res.headers.access-control-allow-methods": "GET",
    "res.headers.cache-control": "private, max-age=0, s-maxage=0, must-revalidate",
    "res.headers.x-request-id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "res.body.status": 500,
    "res.body.type": "https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
    "res.body.title": "Error in Cassandra table storage backend",
    "res.body.detail": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
    "res.body.method": "get",
    "res.body.uri": "/www.wikidata.org/v1/page/html/Q24510758",
    "res.requestName": "from_backend",
    "stack": "HTTPError: 500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error\n    at Request.P.try.bind.then.err.retry.HTTPError.status (/srv/deployment/restbase/deploy/node_modules/hyperswitch/node_modules/preq/index.js:219:23)\n    at Request.tryCatcher (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/util.js:16:23)\n    at Promise._settlePromiseFromHandler (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/promise.js:502:31)\n    at Promise._settlePromise (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/promise.js:559:18)\n    at Promise._settlePromise0 (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/promise.js:604:10)\n    at Promise._settlePromises (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/promise.js:683:18)\n    at Promise._fulfill (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/promise.js:628:18)\n    at Request._callback (/srv/deployment/restbase/deploy/node_modules/bluebird/js/release/nodeback.js:45:21)\n    at Request.self.callback (/srv/deployment/restbase/deploy/node_modules/hyperswitch/node_modules/preq/node_modules/request/request.js:200:22)\n    at emitTwo (events.js:87:13)\n    at Request.emit (events.js:172:7)\n    at Request.<anonymous> (/srv/deployment/restbase/deploy/node_modules/hyperswitch/node_modules/preq/node_modules/request/request.js:1067:10)\n    at emitOne (events.js:82:20)\n    at Request.emit (events.js:169:7)\n    at IncomingMessage.<anonymous> (/srv/deployment/restbase/deploy/node_modules/hyperswitch/node_modules/preq/node_modules/request/request.js:988:12)\n    at emitNone (events.js:72:20)\n    at IncomingMessage.emit (events.js:166:7)",
    "latency": 5068,
    "levelPath": "error/request",
    "root_req.method": "get",
    "root_req.uri": "/www.wikidata.org/v1/page/mobile-sections/Q24510758?redirect=false",
    "root_req.headers.cache-control": "no-cache",
    "root_req.headers.content-length": "0",
    "root_req.headers.user-agent": "ChangePropagation/WMF",
    "root_req.headers.x-client-ip": "::ffff:10.64.0.16",
    "root_req.headers.x-forwarded-for": "::ffff:10.64.0.16",
    "root_req.headers.x-request-id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "root_req.headers.x-request-class": "internal_update",
    "root_req.headers.x-triggered-by": "resource_change:http://www.wikidata.org/api/rest_v1/page/html/Q24510758",
    "request_id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "type": "restbase",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "normalized_message": "500: https://mediawiki.org/wiki/HyperSwitch/errors/query_error",
    "gelf_level": "3"
  },
  "sort": [
    1466520374750
  ]
}
{
  "_index": "logstash-2016.06.21",
  "_type": "restbase",
  "_id": "AVVzbQ_ww3dCNxx2ZHRf",
  "_score": null,
  "_source": {
    "host": "restbase1011",
    "level": "ERROR",
    "version": "1.0",
    "@version": "1",
    "@timestamp": "2016-06-21T14:46:14.693Z",
    "source_host": "10.64.0.113",
    "pid": 1875,
    "message": "500: query_error",
    "res.name": "HTTPError",
    "res.message": "500: query_error",
    "res.status": 500,
    "res.body.type": "query_error",
    "res.body.title": "Error in Cassandra table storage backend",
    "res.body.stack": "Error: Operation timed out - received only 1 responses.\n    at ResponseError.DriverError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/readers.js:309:13)\n    at Parser.parseBody (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:201:66)\n    at Parser._transform (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:113:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:292:12)\n    at writeOrBuffer (_stream_writable.js:278:5)\n    at Parser.Writable.write (_stream_writable.js:207:11)\n    at Protocol.ondata (_stream_readable.js:528:20)\n    at emitOne (events.js:77:13)\n    at Protocol.emit (events.js:169:7)\n    at readableAddChunk (_stream_readable.js:146:16)\n    at Protocol.Readable.push (_stream_readable.js:110:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)",
    "res.body.err.name": "ResponseError",
    "res.body.err.stack": "Error: Operation timed out - received only 1 responses.\n    at ResponseError.DriverError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/readers.js:309:13)\n    at Parser.parseBody (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:201:66)\n    at Parser._transform (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:113:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:292:12)\n    at writeOrBuffer (_stream_writable.js:278:5)\n    at Parser.Writable.write (_stream_writable.js:207:11)\n    at Protocol.ondata (_stream_readable.js:528:20)\n    at emitOne (events.js:77:13)\n    at Protocol.emit (events.js:169:7)\n    at readableAddChunk (_stream_readable.js:146:16)\n    at Protocol.Readable.push (_stream_readable.js:110:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)",
    "res.body.err.message": "Operation timed out - received only 1 responses.",
    "res.body.err.info": "Represents an error message from the server",
    "res.body.err.code": 4608,
    "res.body.err.consistencies": 6,
    "res.body.err.received": 1,
    "res.body.err.blockFor": 2,
    "res.body.err.isDataPresent": 1,
    "res.body.err.coordinator": "10.64.0.232:9042",
    "res.body.err.query": "select \"title\",\"page_id\",\"rev\",\"latest_rev\",\"tid\",\"namespace\",\"restrictions\",\"tags\",\"renames\",\"nextrev_tid\",\"latest_tid\",\"user_id\",\"user_text\",\"timestamp\",\"comment\",\"redirect\",\"_domain\",\"_del\" from \"local_group_default_T_title__revisions\".\"data\" where \"title\" = ? AND \"_domain\" = ?",
    "res.body.req.uri.path.0": "www.wikidata.org",
    "res.body.req.uri.path.1": "sys",
    "res.body.req.uri.path.2": "table",
    "res.body.req.uri.path.3": "title_revisions",
    "res.body.req.uri.path.4": "",
    "res.body.req.body.table": "title_revisions",
    "res.body.req.body.attributes.title": "Q24510758",
    "res.body.req.body.attributes._domain": "www.wikidata.org",
    "res.body.req.body.limit": 1,
    "res.body.req.method": "get",
    "res.body.req.params.domain": "www.wikidata.org",
    "res.body.req.params.api": "sys",
    "res.body.req.params.table": "title_revisions",
    "res.innerBody.type": "query_error",
    "res.innerBody.title": "Error in Cassandra table storage backend",
    "res.innerBody.stack": "Error: Operation timed out - received only 1 responses.\n    at ResponseError.DriverError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/readers.js:309:13)\n    at Parser.parseBody (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:201:66)\n    at Parser._transform (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:113:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:292:12)\n    at writeOrBuffer (_stream_writable.js:278:5)\n    at Parser.Writable.write (_stream_writable.js:207:11)\n    at Protocol.ondata (_stream_readable.js:528:20)\n    at emitOne (events.js:77:13)\n    at Protocol.emit (events.js:169:7)\n    at readableAddChunk (_stream_readable.js:146:16)\n    at Protocol.Readable.push (_stream_readable.js:110:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)",
    "res.innerBody.err.name": "ResponseError",
    "res.innerBody.err.stack": "Error: Operation timed out - received only 1 responses.\n    at ResponseError.DriverError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/readers.js:309:13)\n    at Parser.parseBody (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:201:66)\n    at Parser._transform (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:113:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:292:12)\n    at writeOrBuffer (_stream_writable.js:278:5)\n    at Parser.Writable.write (_stream_writable.js:207:11)\n    at Protocol.ondata (_stream_readable.js:528:20)\n    at emitOne (events.js:77:13)\n    at Protocol.emit (events.js:169:7)\n    at readableAddChunk (_stream_readable.js:146:16)\n    at Protocol.Readable.push (_stream_readable.js:110:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)",
    "res.innerBody.err.message": "Operation timed out - received only 1 responses.",
    "res.innerBody.err.info": "Represents an error message from the server",
    "res.innerBody.err.code": 4608,
    "res.innerBody.err.consistencies": 6,
    "res.innerBody.err.received": 1,
    "res.innerBody.err.blockFor": 2,
    "res.innerBody.err.isDataPresent": 1,
    "res.innerBody.err.coordinator": "10.64.0.232:9042",
    "res.innerBody.err.query": "select \"title\",\"page_id\",\"rev\",\"latest_rev\",\"tid\",\"namespace\",\"restrictions\",\"tags\",\"renames\",\"nextrev_tid\",\"latest_tid\",\"user_id\",\"user_text\",\"timestamp\",\"comment\",\"redirect\",\"_domain\",\"_del\" from \"local_group_default_T_title__revisions\".\"data\" where \"title\" = ? AND \"_domain\" = ?",
    "res.innerBody.req.uri.path.0": "www.wikidata.org",
    "res.innerBody.req.uri.path.1": "sys",
    "res.innerBody.req.uri.path.2": "table",
    "res.innerBody.req.uri.path.3": "title_revisions",
    "res.innerBody.req.uri.path.4": "",
    "res.innerBody.req.body.table": "title_revisions",
    "res.innerBody.req.body.attributes.title": "Q24510758",
    "res.innerBody.req.body.attributes._domain": "www.wikidata.org",
    "res.innerBody.req.body.limit": 1,
    "res.innerBody.req.method": "get",
    "res.innerBody.req.params.domain": "www.wikidata.org",
    "res.innerBody.req.params.api": "sys",
    "res.innerBody.req.params.table": "title_revisions",
    "res.internalReq.uri.path.0": "www.wikidata.org",
    "res.internalReq.uri.path.1": "sys",
    "res.internalReq.uri.path.2": "table",
    "res.internalReq.uri.path.3": "title_revisions",
    "res.internalReq.uri.path.4": "",
    "res.internalReq.body.table": "title_revisions",
    "res.internalReq.body.attributes.title": "Q24510758",
    "res.internalReq.body.attributes._domain": "www.wikidata.org",
    "res.internalReq.body.limit": 1,
    "res.internalReq.method": "get",
    "res.internalReq.params.domain": "www.wikidata.org",
    "res.internalReq.params.api": "sys",
    "res.internalReq.params.table": "title_revisions",
    "res.headers.access-control-allow-origin": "*",
    "res.headers.access-control-allow-methods": "GET",
    "res.headers.access-control-allow-headers": "accept, content-type",
    "res.headers.access-control-expose-headers": "etag",
    "res.headers.cache-control": "private, max-age=0, s-maxage=0, must-revalidate",
    "res.headers.x-content-type-options": "nosniff",
    "res.headers.x-frame-options": "SAMEORIGIN",
    "res.headers.x-xss-protection": "1; mode=block",
    "res.headers.content-security-policy": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.x-content-security-policy": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.x-webkit-csp": "default-src 'none'; frame-ancestors 'none'",
    "res.headers.x-request-id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "stack": "Error: Operation timed out - received only 1 responses.\n    at ResponseError.DriverError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:14:19)\n    at new ResponseError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/errors.js:51:24)\n    at FrameReader.readError (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/readers.js:309:13)\n    at Parser.parseBody (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:201:66)\n    at Parser._transform (/srv/deployment/restbase/deploy/node_modules/restbase-mod-table-cassandra/node_modules/cassandra-driver/lib/streams.js:113:10)\n    at Parser.Transform._read (_stream_transform.js:167:10)\n    at Parser.Transform._write (_stream_transform.js:155:12)\n    at doWrite (_stream_writable.js:292:12)\n    at writeOrBuffer (_stream_writable.js:278:5)\n    at Parser.Writable.write (_stream_writable.js:207:11)\n    at Protocol.ondata (_stream_readable.js:528:20)\n    at emitOne (events.js:77:13)\n    at Protocol.emit (events.js:169:7)\n    at readableAddChunk (_stream_readable.js:146:16)\n    at Protocol.Readable.push (_stream_readable.js:110:10)\n    at Protocol.Transform.push (_stream_transform.js:128:32)",
    "latency": 5005,
    "levelPath": "error/request",
    "root_req.method": "get",
    "root_req.uri": "/www.wikidata.org/v1/page/summary/Q24510758?redirect=false",
    "root_req.headers.cache-control": "no-cache",
    "root_req.headers.content-length": "0",
    "root_req.headers.user-agent": "ChangePropagation/WMF",
    "root_req.headers.x-client-ip": "::ffff:10.64.16.21",
    "root_req.headers.x-forwarded-for": "::ffff:10.64.16.21",
    "root_req.headers.x-request-id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "root_req.headers.x-request-class": "internal_update",
    "root_req.headers.x-triggered-by": "resource_change:http://www.wikidata.org/api/rest_v1/page/html/Q24510758",
    "request_id": "5fd74013-0696-44a2-892b-43fd113d486d",
    "type": "restbase",
    "tags": [
      "es",
      "gelf",
      "normalized_message_untrimmed"
    ],
    "normalized_message": "500: query_error",
    "gelf_level": "3"
  },
  "sort": [
    1466520374693
  ]
}

which indicates a problem talking to cassandra "res.innerBody.err.message": "Operation timed out - received only 1 responses.", and this lines up with a reboot for kernel upgrade of restbase1007 (mobileapps dashboard: https://grafana.wikimedia.org/dashboard/db/mobileapps?from=1466518372060&to=1466519972060)

Event Timeline

fgiunchedi renamed this task from mobileapps 500s to mobileapps 500s following reboot of restbase1007.Jun 21 2016, 3:04 PM
mobrovac subscribed.

The list of titles that erred out, P3330, shows that various domains were affected, which can explain the errors witnesses right after the reboot (even though the question remains as to why other nodes were not involved in the query). Note, however, that the outage was likely not perceived by clients, as these errors relate to pre-generation update jobs.

MCS logs are sent to logstash, but weirdly enough, these particular log entries never made it there ( cf https://logstash.wikimedia.org/#dashboard/temp/AVW6r4O-c8qLrUhXldEw ). We need to investigate why.

It's worth noting that restbase was depooled during the reboot, so I'm wondering why it still tried to access it?

It's worth noting that restbase was depooled during the reboot, so I'm wondering why it still tried to access it?

The error itself is coming from Cassandra on restbase1007, not RESTBase the service itself. Depooling works only for RB the service.

Mentioned in SAL [2016-07-20T13:28:39Z] <urandom> Restarting restbase1008-a.eqiad.wmnet to apply a (ephemeral) 7200000ms streaming timeout : T138314

Mentioned in SAL [2016-07-20T13:29:59Z] <urandom> Performing rolling RESTBase restart to work-around Cassandra instance restart fallout : T138314 and T138314

Mentioned in SAL [2016-07-20T13:29:59Z] <urandom> Performing rolling RESTBase restart to work-around Cassandra instance restart fallout : T138314 and T138314

GWicke subscribed.

This looks out of date. @mobrovac, time to close it?

mobrovac claimed this task.
mobrovac edited projects, added Services (done); removed Services (attic).

Indeed.