Stop relying on Redis LUA scripting for jobrunner
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	May 6 2015, 6:02 AM

Description

I think that last weekend's outage is testament to how dangerous using this feature is. There are several arguments to avoid the use of LUA scripting in Redis:

It's a CPU black hole and a monitoring nightmare. One's datastore suddenly munching on CPU because of a stored procedure isn't something any Ops person wants.

It can bring down the data store and/or the site easily.

It's inefficient, with the script having to be pushed every time.

Fellow large shops using Redis consider that "LUA is not production ready in Redis today" http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html

Resorting to LUA scripting is a sign that Redis isn't the right tool for the task at hand. If that scripting capability wasn't there, a solution using another tool would have to be implemented, and this is what we should be doing.

What is the LUA scripting currently being used for in the jobrunner? Could it be reimplemented in PHP? Is it a sign that Redis isn't the best queue provider and that we should be using something else?

Event Timeline

• Gilles created this task.May 6 2015, 6:02 AM

• Gilles raised the priority of this task from to High.

• Gilles updated the task description. (Show Details)

• Gilles added projects: Sustainability, Performance-Team.

• Gilles added subscribers: • Gilles, aaron, ori, • chasemp.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 6 2015, 6:02 AM

It's inefficient, with the script having to be pushed every time.

No; that's what EVALSHA is for.

Fellow large shops using Redis consider that "LUA is not production ready in Redis today"

The context for the quote makes it clear that they are talking about exposing Lua scripting as a generic facility on Redis clusters that have multiple applications as tenant clients. In that kind of setup, Lua scripting is a non-starter because of the ease with which a single developer error can cause a failure to cascade to multiple services. Our setup is different: we use Lua only on the job queue redises, and discourage the use of Lua elsewhere.

I'm not a fan of our use of Lua either, but I think the incident documentation demonizes it in a naive and unrigorous fashion.

I think the job queue needs to be overhauled pretty substantially. It has some major problems, like the lack of a fair and efficient job scheduling mechanism. But it's a lot of work, it's pretty remote from performance, and the value to users would be small. In my opinion, a better bet would be to improve monitoring but leave the current architecture in tact until we are ready to commit to this problem-domain the engineering resources a proper solution would require.

• Gilles removed a project: Performance-Team.May 6 2015, 9:57 AM

• Gilles set Security to None.

We should try to investigate the outage better to see what was the exact cause, then. Is this the right machine to look at for job queue redis? http://ganglia.wikimedia.org/latest/?r=custom&cs=05%2F01%2F2015+00%3A00&ce=05%2F04%2F2015+00%3A00&m=cpu_report&c=Redis+eqiad&h=rdb1001.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=ALLGROUPS

Job queue has long since been migrated to changeprop

Stop relying on Redis LUA scripting for jobrunnerClosed, ResolvedPublicActions

Description

Event Timeline

Stop relying on Redis LUA scripting for jobrunner
Closed, ResolvedPublic
Actions