User Details
- User Since
- Apr 28 2015, 10:03 PM (469 w, 1 d)
- Availability
- Available
- LDAP User
- Dfko
- MediaWiki User
- Unknown
Jan 4 2018
Hi, I am looking around for the offending files to delete them, but it has been a long while since I worked on any of this and I don't recall how to find them.
Can you provide the exact path of the offending files?
Aug 28 2016
I need libxml2 (probably also libxml2-dev in kubernetes for recitation-bot since it uses the command-line xsltproc to run xslt processing. -dev will allow me to convert it to use python's lxml wrapper to do same. (python3 container)
May 8 2015
Thanks
Interesting...
Getting back on it.
Can you post that dump for me somewhere?
May 1 2015
We have workers on the failed queue already. We were getting things going from failed -> failed forever due to this but that should be resolved.
Do you think the default of 15 minutes will be too long? >20 events per second is rare, so a very conservative estimate of queue load would be 18,000 events in flight.
I've run for about 20 minutes, seems like queue has plateaued at about 2 gigs and is stable. Will keep on keeping an eye on it for a bit longer.
It happens whenever I pass anything via the command line parameter of worker, whether it can be parsed as an int or not, so I am guessing they just forgot to ever call int() on that parameter. Haven't yet gone to track down where that should be happening though.
I think we will have to live with dropping events under load vs. filling queues but 15 minutes seems plenty to keep up with bursts we've seen so far as the final stage in the pipeline does not take long.
I was not able to find an explicit TTL format that would not trigger this bug, but not specifying it at all seems not to trigger it. I am going to try to start things up again and see if it works that way. The default TTL is 500 seconds.
Considered early on and this is making me consider considering it again, though there might not remain enough time in the project for a big move like that.
Still figuring out the significance of this, but there is a failure in a line that is dealing with TTLs in the RQ library registry.py that is resulting in churn from failure queue back onto the failure queue which may be behind this:
Apr 30 2015
We started it up again to get a key dump (see previous 3 comments)
@yuvipanda
Apr 29 2015
@valhallasw We're up to about a gigabyte, can I get a dump now?
Great, I was going to ask you about that anyway.
Yeah, I've been trying to figure this out, my best guess from the documentation so far is the failed jobs may not get timeouts set the same way originally submitted jobs do, need to work with this hypothesis more and maybe dig into the source of rq.
Apr 28 2015
Can you specify exactly what action you took to remove these jobs from the queue? Our service has been down while I've been hunting for the bug to no avail so far so I'd like to start it up again but with an additional task that periodically manually clears the queue so we can stay up until a better solution is found.