Page MenuHomePhabricator

hang or infinite loop
Closed, ResolvedPublic

Description

running that script with -always option over hundreds of pages putting page changes asynchronously leads to infinity loop or the script hangs consuming all resource of on CPU (25% of a 4 core machine) and couldn't be stopped via Ctrl-C.

Event Timeline

Xqt raised the priority of this task from to Needs Triage.
Xqt updated the task description. (Show Details)
Xqt changed Security from none to None.
Xqt subscribed.

c:\Pywikipedia\ssh\pywikibot\core>pwb.py version
Pywikibot: [ssh] pywikibot-core (rPWBC117134afd32d, g4806, 2014/12/20, 15:04:24, OUTDATED)

Release version: 2.0b3
httplib2 version: 0.9

cacerts: C:\Pywikipedia\ssh\pywikibot\core\externals\httplib2\python2\httplib2

\cacerts.txt

certificate test: ok

Python: 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)]

unicode test: ok

PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB: C:\Pywikipedia\ssh\pywikibot\core
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: c:\Pywikipedia\ssh\pywikibot\core

This was not due to solve_disambiguation script. Investigating in that matter, the ^C was not recognized if the async buffer exceeds max_queue_size. This happened because no page was putted to live wiki. I found another hint to the api cache while restarting the script:
Could not load cache: TypeError("'int' object is not iterable)

Very strange but clearing the api_cache folder solves the problem. Maybe pywikibot.async_manager returns None?

Another idea: Maybe it had sth to do with cache implementation of https://gerrit.wikimedia.org/r/#/c/131017/ which was fetched on my repository as I started this script.

jayvdb renamed this task from solve_disambiguations.py hangs or has an infinit loop to hang or infinite loop.Jun 4 2015, 9:46 AM
Xqt triaged this task as High priority.Feb 27 2016, 10:10 AM

I found my xqbot/afd-notice.py script hanging several times when putting a page. ctrl-c doesn't work; I've no CPU consumption time in that case. That bug occurs inside SingleSiteBot.userPut() after the summary was printed; there is no async saving in this script.

Reading old article list...
307 articles found
Reading ignoring lists...
155 users found to opt-out
Check for moved pages...
Processing data...

>>> Bauwelt Delmes Heitmann <<< is tagged for deleting
>>> Creator is Delmesheitmann


>>> Benutzer Diskussion:Delmesheitmann <<<
@@ -0,0 +1 @@
+ {{ers:user:xqbot/LD-Hinweis|Bauwelt Delmes Heitmann|angelegte}}

Edit summary: Bot: Benachrichtigung über Löschdiskussion zum Artikel [[Bauwelt D
elmes Heitmann]]

Looks like 'async' does the problem:

if 'async' not in kwargs and self.getOption('always'):
    kwargs['async'] = True

Investigating deeper into this issue I found a problem inside BaseBot class:

BaseBot.userPut invokes BaseBot._save_page with parameters page for the page and page.save for the function and **kwargs at last.

Now BaseBot._save_page tries to call the given function:

func(*args, **kwargs)

with should be equal as calling

page.save(*args, **kwargs)

page.save made an async_request with page._safe as function but page._save itself was never reached and there was no exception or message or any other hint to that problem; the task just hangs and couldn't be stopped with ctrl-c. Killing the task was the only way to kill him.

Maybe we should use page_put_queue.get_nowait() or use a timeout parameter for page_put_queue.get() in async_manager?

Indeed there is a problem with async put.
async_request() calls
page_put_queue.put((request, args, kwargs)) in its last line but the called method never terminates.
page_put_queue is a Queue object with mal queue length of config.max_queue_size which is 64 by default (and in this case too)
I also figured out that async_manager was never reached also in that cases the script works correctly which looks courious.
Now he have a problem inside thy python library (py 2.7.10 here) or an unhandled timeout problem. Any other suggestions?
I propose to deactivate async stuff until the problem is found.

Ah that might be the problem. async_manager was never reached because the buffer was full from previous puts to the queue and it seems the queue was not decreased. Maybe there is a timeout error when putting the page but the request wasn't removed from the queue. My problem is that that bug occurs after a long delay which might be several days. I will decrease the max_queue_size now to enforce the problem.

I guess I've found the problem.

Change 276428 had a related patch set uploaded (by Xqt):
[bugfix] reset global stopped variable and unfreeze async put.

https://gerrit.wikimedia.org/r/276428

Hm, there is sth wrong with stopme():
I got
Waiting for 1 pages to be put. Estimated time remaining: 0:00:05
but the queue should be empty because there were no page to put!

Could you provide a new summary of the bug, explaining why it was hanging. Which part of this changeset does the 'unfreeze' part? It seems like the changeset has a few other closely related fixes in it, but they are more cosmetic, making it hard to see the exact 'freeze' bug and its fix.

Change 276428 merged by jenkins-bot:
[bugfix] Re-enable stopme() called inside a script

https://gerrit.wikimedia.org/r/276428