Page MenuHomePhabricator

The OCG cleanup cache script doesn't work properly
Closed, ResolvedPublic

Description

I tried to run the script, but it hanged up multiple times when trying the incredibly inefficient operation of scanning all the ocg jobs from ocg_job_status on redis.

So it doesn't work. I would suggest that we keep a list of cached items per server in redis, besides what we have in ocg_job_status, so that it's easy for us to know which items we need to purge and which we don't.

This is extremely important as we don't have a way to depool a server right now.

Event Timeline

Joe created this task.Dec 2 2015, 11:45 AM
Joe raised the priority of this task from to Unbreak Now!.
Joe updated the task description. (Show Details)
Joe updated the task description. (Show Details)
Joe set Security to None.
Joe added subscribers: Krenair, Matanya, Joe and 8 others.

Any idea who should be assigned to this / whose responsibility this might be?

ori added a subscriber: ori.Dec 11 2015, 8:26 AM

If we moved ocg to one of the Jessie boxes, which have redis 2.8, we'd be able to iterate on keys using [[ http://redis.io/commands/hscan | HSCAN ]].

Joe added a comment.Dec 11 2015, 8:27 AM

@Aklapper apparently no one wants to take responsibility for this. And it's a problem of course.

Ottomata added a subscriber: Ottomata.

I'm not sure who to poke about this, so I'm adding it to Scrum-of-Scrums board. Please find a taker in that meeting.

ssastry added a subscriber: ssastry.EditedDec 14 2015, 9:00 PM

I'm going to ping @cscott about it tomorrow at the parsing team meeting. He will also likely be showing up for the Scrum of Scrums meeting on Wednesday.

I'll take a look at this. My recollection is that I found sort of fundamental problems with the way servers are depooled (T105372). I don't think this is really an "unbreak now" priority task, though -- as mentioned in T84723, just service ocg stop is sufficient for maintenance tasks like T84723 (if not ideal).

Joe added a comment.Dec 16 2015, 8:26 AM

@cscott: No it's not. All clients (as in real users of wikipedia and
the content extension) requesting a document that is supposed to be in
the depooled server's cache will be presented with an error instead
than the document.

@cscott, any news on this?

Gehel added a subscriber: Gehel.Feb 10 2016, 2:46 PM

@Aklapper apparently no one wants to take responsibility for this. And it's a problem of course.

Indeed. @cscott, your comment T120079#1881094 implies you feel ownership of this; do you feel ownership and authority over priority setting for OCG?

Joe added a comment.Mar 17 2016, 8:09 AM

This task might not be "UBN!" in the mind of some of us, but it lies untouched since December and, for the record, I won't fix any problem arising from ocg other than "restarting the service" since this is clearly just abandonware currently.

I see this as a collective failure we should take into account when/if we move progressively to a more microservice-oriented architecture.

We're doing an horrible service to our community, and no one seems to care.

I'm adding TechCom-RFC to add this to TechCom's managed backlog

@Joe not sure this is a "horrible service to our community" when the service is running just fine and doing what the community wants it to.

Yes, OCG has been "abandonware" for a long time now. It has < 10% of a single developer's time. I have attempted to get more community maintenance of the codebase, without much success. But at the Wikimania 2016 hackathon the Kiwix folks are proposing to do serious work on OCG, as they would like to add a ZIM backend for Kiwix. Hopefully this will lead to greater involvement of the Kiwix community in OCG.

Unfortunately, I didn't implement the basic service architecture of OCG. The shutdown script mentioned in the header does work, I've used it in the past, even though it is slow. Help/patches certainly appreciated to improve it, of course. I believe we've upgraded redis since this bug was filed, so we should now be automagically using the more-efficient HSCAN operation as well, which should help.

Technical discussion:
I think the actual blocking bug for T84723 is not this one -- the provided script works, or at least could be made to work. The blocking bug is actually T120077 -- currently as long as the service is up (to respond to cache requests) it is also pulling new jobs from the queue and servicing them, continuing to refill it's queue. That has a reasonably-straightforward solution (discussion in T120077) and after that is fixed, you can just wait for the cache entries to expire naturally in a few days (3, I think) and then decommission the server -- even without the use of the host decommissioning script (which I still contend is not actually broken).

cscott lowered the priority of this task from Unbreak Now! to High.Mar 17 2016, 10:20 PM
cscott added a comment.May 5 2016, 5:58 PM

@Joe and @cscott ran the script today, after deploying the T120077 change to decommission ocg1003.

The result was that the script hung after processing approx 3267 entries (out of ~400,000) from the redis queue:

(01:33:13 PM) _joe_: cscott: and it's using hscan
(01:33:32 PM) _joe_: hscans are very fast
(01:34:32 PM) _joe_: ok, what I see is that after a short while those error messages are spawned and all talking with redis stops
(01:12:42 PM) _joe_: cscott: [2016-05-05T17:12:22.371Z] ERROR: mw-ocg-service/46865 on ocg1003: (node) warning: Recursive process.nextTick detected. This will break in the next version of node. Please use setImmediate for recursive deferral.
(01:45:13 PM) _joe_: cscott: so, a brief peek at that packet capture; last thing it does is hscan ocg_job_status 3276 
(01:45:33 PM) _joe_: which returns instantly when executed on the redis server
(01:45:40 PM) _joe_: so it's definitely not waiting for redis
(01:46:47 PM) _joe_: cscott: not a single HDEL was done in this run of the script
(01:47:25 PM) cscott-free: _joe_: there are about 400k status objects, so it's not getting far through.  makes sense that it's cleaned out all job entries in the first 3276 status objects, though, so it's not doing any more hdels.

The problem seems to be node 0.10's "recursive nextTick" prohibition:

(01:49:35 PM) cscott-free: it seems like there's actually a hardcoded limit in node 0.10, so the warning isn't actually harmless, it's dropping some tasks on the floor once we hit the limit.
(01:49:40 PM) cscott-free: that's why the job is getting stuck.
(01:49:56 PM) cscott-free: solution: upgrade node (hard) or upgrade some dependencies (probably easier)
(01:50:16 PM) cscott-free: the node 0.10.x series was not well-loved.
(01:51:07 PM) cscott-free: (looking at https://graphite.wikimedia.org/ for the ocg.pdf.status_objects key, you can see that the cache script did successfully knock out a few of the status entries from redis, just not nearly enough of them)
(01:47:52 PM) _joe_: cscott: I guess that script needs to be worked on; nonetheless I think writing a simple cache wiping script should be doable 
(01:48:08 PM) _joe_: even on my side, in python, if needed
(01:47:49 PM) cscott-free: _joe_: no worries.  i did some digging on the nexttick thing, i might be able to work around it for node 0.10 by upgrading some dependencies.
(01:48:09 PM) cscott-free: _joe_: so i'll do that, and add some better feedback to the script so we can tell on console when it's working vs stuck.
(01:48:16 PM) _joe_: ok cool
(01:48:41 PM) cscott-free: _joe_: probably deploy on monday?  although by that time most of the entries should have expired from the cache naturally.
(01:48:53 PM) cscott-free: _joe_: so early next week we should be able to actually take down ocg1003 regardless.
(01:49:00 PM) _joe_: ok :)

(IRC log edited and reordered a bit.)

I plan to update OCG to the latest version of prfun+babybird, which use setImmediate and not process.nextTick at all. That should fix the issue for node 0.10.

Mentioned in SAL [2016-05-10T17:41:38Z] <cscott> updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3 (T120079)

Mentioned in SAL [2016-05-10T18:00:50Z] <cscott> OCG: clearing cache for ocg1003.eqiad.wmnet and ocg1003 (T120079)

Mentioned in SAL [2016-05-10T18:01:38Z] <cscott> OCG: script reported "Cleared 0 (of 363141 total) entries from cache in 56.894 seconds" (T120079)

cscott closed this task as Resolved.May 10 2016, 6:03 PM
cscott claimed this task.

Ok, looks like the script is fixed (see above). Of course, the entries all expired naturally from the cache over the weekend, but it made it through all 363141 redis keys without crashing with the "recursive nextTick" issue.

Joe added a comment.May 12 2016, 5:31 AM

Ok, looks like the script is fixed (see above). Of course, the entries all expired naturally from the cache over the weekend, but it made it through all 363141 redis keys without crashing with the "recursive nextTick" issue.

I will reinsert ocg1003 in the pool and try a cleanup of ocg1002 now, just to confirm this is effectively solved.