Page MenuHomePhabricator

Implement flag to tell an OCG machine not to take new tasks from the redis task queue
Closed, ResolvedPublic

Description

So since ocg1003 was misbehaving lately, I took a look:

the fact it was not decommisioned nor reimaged (see T84723) meant it was still processing jobs and, quite surprisingly, receiving requests from the appservers in this form:

GET /?command=download_file&collection_id=75e63cbbd79765efb78d951745923ae3c2a6a5f2&writer=rdf2latex	HTTP/1.0
Host: ocg1003:8000
...

coming in directly from the appservers. Even without accounting for the error in setting the Host header this breaks quite spectacularly our whole infrastructure: we assume that all traffic to our individual hosts should be mediated by a load balancer and this clearly violates that.

I seem to understand this is not easy to fix, but it needs to be fixed as soon as possible - or this would mean that whenever an host goes down for any reason we'll fail a portion of the requests.

[CSA] We can implement a flag to tell ocg1003 to stop taking new tasks from the queue. In conjunction with one of the queue purging scripts (or just waiting a few days) that will ensure that ocg1003 becomes quiescent with no cached files, and can be taken down.

Event Timeline

Joe created this task.Dec 2 2015, 11:33 AM
Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe added subscribers: Krenair, Matanya, Joe and 8 others.
Joe lowered the priority of this task from Unbreak Now! to High.Dec 2 2015, 3:05 PM
Joe set Security to None.
bd808 added a comment.Dec 2 2015, 5:04 PM

I seem to understand this is not easy to fix

CommonSettings.php points to the LVS name ocg.svc.eqiad.wmnet:8000

I can't find a reference to ocg1003 directly anywhere in the current production configuration.

What I did find however is that an OCG instance itself can create URLs like this while processing jobs based on this target host configuration:
host = jobDetails.host = config.coordinator.hostname || os.hostname();

It looks to me like there is a config setting (config.coordinator.hostname) that could be pointed at the LVS service name. I don't know enough about the actual code here to know if that would actually work however or if the fallback to os.hostname() has been done purposefully because the generated content is only available locally on the OCG server that generated the output.

I don't know much too much about OCG's Redis queues, but one possibility to potentially look into is stored jobs in the queue referencing specific hosts. I vaguely remember arguing about that before.

@cscott should know more.

Joe added a comment.Dec 2 2015, 7:04 PM

@bd808 apart from finding the code itself, what is problematic is that this is an essential part of the design of OCG, as I understand it

  1. The appservers require a redering to the load balancer, that sends the request to a random backend
  2. The backend enqueues the rendering request in redis
  3. A background worker in every ocg node processes jobs from this queue
  4. the status of the job once completed reports the host where the actual pdf is hosted
  5. Any subsequent request for that pdf should go to the specific server that has the content, directly

Possible solutions I see given what we have available could be to use some distributed data store as a backend. since it's just a cache, something fast and not perfectly reliable is ok. Using nutcracker could do it, but AFAIR it only supports in-memory stores. Substantially, if we do a consistent hashing based distribution of storage and we do the same for serving the result, we could avoid this clunky system.

In any case, it's a big, big modification of the current behaviour/architecture we're talking about.

So, from my perspective, all that needs to be done here is to have the OCG service check a per-machine status flag (either a local file on that machine, or some sort of boolean stored in redis, or even a line in a puppet-managed config file listing servers) which indicates that that particular server should "no longer accept new jobs". The service can be restarted after that flag is set, and that will cause the server to stop pulling new jobs from the queue (while leaving it up to respond to cache requests).

The intention was that load balancing was handled by distributing the cache among the servers; that is, it is done when the servers compete to pull jobs from the job queue. This wasn't my architecture originally, but IIRC the idea was that this avoided the need for some global shared file storage. I think this point is orthogonal to the needs of T84723, so if we really want to continue the rearchitecture/load balancing discussion we should probably open a new task for that.

Joe added a comment.Apr 20 2016, 5:53 AM

Adding to the pyle of embarassment than our handling of ocg issues is, OCG did not work properly across datacenters because of this, see T133136

This also means that, unless we do some hack like calling servers with different hostnames in codfw, and adding DNS fake entries to both DCs, we can't make OCG work in two datacenters,

Raising this to UBN! again in the vane hope the org will notice.

Joe raised the priority of this task from High to Unbreak Now!.Apr 20 2016, 5:53 AM
Restricted Application added subscribers: TerraCodes, Urbanecm. · View Herald TranscriptApr 20 2016, 5:53 AM
cscott added a comment.EditedApr 20 2016, 3:06 PM

I commented on T133136 -- we need some design work here it seems. There's not a simple fix, other than purging the OCG cache in redis at the time of the switch. Purging redis *can* in fact be done easily, with no new development, but it wasn't done. So that seems like a failure in process, not a flaw in the code.

Which isn't to say that it would be nice to better support decommissioning hosts and all. I'm just saying this may be the wrong bug, and I think you are conflating a number of different issues.

Restricted Application added a subscriber: Luke081515. · View Herald TranscriptApr 20 2016, 3:06 PM

Well, it would be nice if it was really "the org". But in reality it's just me. There's no other staffing for OCG, despite requests over a period of years. So it's really better just to talk to me about things, rather than hoping "the org" will notice.

I commented on T133136 -- we need some design work here it seems. There's not a simple fix, other than purging the OCG cache in redis at the time of the switch. Purging redis *can* in fact be done easily, with no new development, but it wasn't done. So that seems like a failure in process, not a flaw in the code.

Can you document the process for making OCG work when data centers are switched (or work cross data center)? If OCG sticks around and we need to do this again in the future, this information needs to be part of the documented process for data center switching.

Created new task: T133164: Document eqiad/codfw transition plan for OCG. I could use some help from ops -- I don't know where this documentation canonically lives.

Refocusing this bug: I described above adding a per-machine flag to tell specific hosts not to check the redis queue: https://phabricator.wikimedia.org/T120077#2132499

Since no one really objected to (or commented on!) that concrete proposal, I'll try to slap together a patch of that sort and get it deployed on Monday. I'm going to Be Bold and retitle this bug to reflect that patch, which is a solution to the "decommission one host" problem.

The general "datacenter failover" task is probably best done by having a separate redis instance in codfw entirely, so that after the switch there's a new task queue that all the codfw machines are watching, with a clean slate as far as cached files, etc. We can discuss that further in T133164: Document eqiad/codfw transition plan for OCG, keeping this bug on the "take down a single machine" task.

cscott renamed this task from OCG should not be contacted directly from the appservers but only via LVS to Implement flag to tell an OCG machine not to take new tasks from the redis task queue.Apr 20 2016, 4:18 PM
cscott updated the task description. (Show Details)

So... as to implementation. Would ops prefer to decommission a host via puppet or redis? The puppet option would be to have puppet create a specific "shutdown" file on the targeted host's filesystem. The redis option is running some redis command on the command line to create entries for a specific host or hosts. It seems puppet (aka local file) is best? The puppet script could probably even run the appropriate command to purge redis of entries pertaining to a specific host.

If you've got other ways you'd prefer doing this, let me know.

Change 284601 had a related patch set uploaded (by Cscott):
Allow decommissioning OCG hosts.

https://gerrit.wikimedia.org/r/284601

Change 284601 merged by jenkins-bot:
Allow decommissioning OCG hosts.

https://gerrit.wikimedia.org/r/284601

Restricted Application added a subscriber: Southparkfan. · View Herald TranscriptApr 26 2016, 9:48 PM
Joe added a comment.Apr 27 2016, 8:22 AM

@cscott so the way to depool a server will be:

  1. Remove it from the pool in the load balancer
  2. put that flag on the local filesystem
  3. Clean redis anyways so that mw* hosts stop contacting that specific host directly

If you think what we do for every other software we have is point 1)... it shows OCG does need a more profound restructuring.

btw, point 3 relies on a broken script afaik.

Joe added a comment.EditedApr 27 2016, 8:26 AM

Also, I don't really understand your mocking of me saying the WMF should really take care about this collectively by continuing to double-quote "the org".

If it's just you, we should simply turn this off, as it's not going to be maintainable.

I believe step 1 and step 2 can be performed in a single puppet commit. Step 3 can probably be integrated, but I'd like to go through the process with ocg1003 first and make sure we've got the process down.

Joe added a comment.Apr 28 2016, 6:30 AM

Step 1 is not a puppet commit anymore; I guess even the flag to put on the FS could be done outside of puppet.

Thinking better about the mechanism, to give nearly-zero downtime to users we should do operations in the following order:

  • Set the FS flag so we stop immediately processing new jobs from the queue
  • Remove the server from the LB, so that no new processing requests are served from that machine
  • Run the cleanup script (which I plan to test now that HSCAN is supported by our servers)

If the script works, we'll have a way to properly depool a server for maintenance, even if it requires a special treatment.

The need for a restructuring will still be there, but this would make for an acceptable stopgap solution

And the cleanup script can be integrated (once we test & validate it) so that a machine which finds itself on the 'decommissioned' list can immediately begin emptying its cache. So that's just a two step process.

For transition between eqiad and codfw, one option is: first add the codfw machines to the load balancer, then add the eqiad machines to the decommission list (which will also have them start emptying their caches), then once the caches are empty remove eqiad from the load balancer. But that requires routing between the codfw and eqiad clusters to support the transition period where requests are being served out of both caches.

An alternative is to separate the redis instances for codfw and eqiad, and switch the load balancer in one shot without any mucking around with decommissioning. That's effectively starting with a cold cache in codfw, but that's not terrible with our current low hit rates. It would interrupt any jobs in the progress of being built at the moment when the switchover took place, which is non-ideal. But it wouldn't require cross-cluster routing.

I attempted a puppet patch to use this functionality in https://gerrit.wikimedia.org/r/286070

Please take a look and let me know if I'm doing things right -- completely untested right now.
Maybe we want to try decommissioning one of the beta/labs OCG machines (deployment-pdf02, say) before letting puppet put this on the production OCG cluster?

cscott closed this task as Resolved.May 5 2016, 6:00 PM
cscott claimed this task.

Ok, we deployed the config change and tested everything and confirmed that it works as intended to place the server in "decommission mode", where it no longer starts backend tasks.

Resolving this task as closed. We didn't successfully empty out the cache for ocg1003 due to T120079: The OCG cleanup cache script doesn't work properly, but we'll fix that as a separate task.

Restricted Application added a subscriber: Jay8g. · View Herald TranscriptOct 16 2016, 1:05 PM