Page MenuHomePhabricator

Kademlia rate limiter failing unexpectedly
Closed, ResolvedPublic

Description

The kademlia rate limiter started failing and bringing down RESTBase since this occurs in the master process. The error trace is:

TypeError: Cannot read property 'publisher' of null
    at Readable.<anonymous> (/srv/deployment/restbase/deploy-cache/revs/55fcd4be9d0c72ddf4b826813a6354249a5c3f64/node_modules/kad/lib/node.js:239:13)
    at emitOne (events.js:96:13)
    at Readable.emit (events.js:188:7)
    at addChunk (/srv/deployment/restbase/deploy-cache/revs/55fcd4be9d0c72ddf4b826813a6354249a5c3f64/node_modules/limitation/node_modules/readable-stream/lib/_stream_readable.js:291:12)
    at readableAddChunk (/srv/deployment/restbase/deploy-cache/revs/55fcd4be9d0c72ddf4b826813a6354249a5c3f64/node_modules/limitation/node_modules/readable-stream/lib/_stream_readable.js:278:11)
    at Readable.push (/srv/deployment/restbase/deploy-cache/revs/55fcd4be9d0c72ddf4b826813a6354249a5c3f64/node_modules/limitation/node_modules/readable-stream/lib/_stream_readable.js:245:10)
    at Immediate.pushItem (/srv/deployment/restbase/deploy-cache/revs/55fcd4be9d0c72ddf4b826813a6354249a5c3f64/node_modules/limitation/lib/decaying_counter_store.js:130:24)
    at runCallback (timers.js:672:20)
    at tryOnImmediate (timers.js:645:5)
    at processImmediate [as _immediateCallback] (timers.js:617:5)

It started on 2018-11-26T00:06:59 and has happened a total of 74 times as of 2018-12-26T18:16:00.

Event Timeline

mobrovac created this task.

Ok, I tracked down the problem to the kad library. In order to have the fix in place, the following patches need to be merged (in this order):

Obviously, a more sustainable solution needs to be found, all of the above is just a temporary fix.

Even though this is just a workaround, we want to replace the fork with upstream ASAP, so this would be good enough to stop the bleeding. I approve this message!

Change 481491 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[mediawiki/services/cxserver@master] Bump service-runner version to v2.6.9

https://gerrit.wikimedia.org/r/481491

Mentioned in SAL (#wikimedia-operations) [2018-12-27T17:15:54Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@ae7a537]: Fix rate-limiter crash - T212631 - deploy only on canary restbase1007

Mentioned in SAL (#wikimedia-operations) [2018-12-27T17:20:18Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@ae7a537]: Fix rate-limiter crash - T212631 - deploy only on canary restbase1007 (duration: 04m 24s)

Mentioned in SAL (#wikimedia-operations) [2018-12-27T17:50:18Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@70c4752]: Fix rate-limiter crash - T212631

Mentioned in SAL (#wikimedia-operations) [2018-12-27T18:03:28Z] <mobrovac@deploy1001> deploy aborted: Fix rate-limiter crash - T212631 (duration: 13m 09s)

Mentioned in SAL (#wikimedia-operations) [2018-12-27T18:04:06Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631

Mentioned in SAL (#wikimedia-operations) [2018-12-27T18:04:15Z] <mobrovac@deploy1001> deploy aborted: Fix rate-limiter crash (with increased deploy delays) - T212631 (duration: 00m 09s)

Mentioned in SAL (#wikimedia-operations) [2018-12-27T18:04:24Z] <mobrovac@deploy1001> Started deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631

Mentioned in SAL (#wikimedia-operations) [2018-12-27T18:15:44Z] <mobrovac@deploy1001> Finished deploy [restbase/deploy@a67f38e]: Fix rate-limiter crash (with increased deploy delays) - T212631 (duration: 11m 20s)

Change 481491 merged by jenkins-bot:
[mediawiki/services/cxserver@master] Bump service-runner version to v2.6.9

https://gerrit.wikimedia.org/r/481491