Page MenuHomePhabricator

Parallel npm install times out in Quibble
Open, Needs TriagePublic

Description

We had a mysterious Quibble build time last week when running npm install in parallel. We talked about it over irc in #wikimedia-releng.

The build we looked at is https://integration.wikimedia.org/ci/job/wmf-quibble-selenium-php72-docker/138874/ (marked as keep forever in jenkins).

Other builds:


Run `npm audit` for details.
<<< Finish: npm install in /workspace/src/extensions/MobileFrontend, in 77.281 s

DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 80s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 90s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 100s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 110s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 120s elapsed, 60/61 completed
...

DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 3450s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 3460s elapsed, 60/61 completed
DEBUG:quibble.util:Waiting for Parallel npm install for projects with 'selenium-test' in package.json: 3470s elapsed, 60/61 completed
Build timed out (after 60 minutes). Marking the build as failed.
Build was aborted

It sounds like a race condition / thread safety issue within the Parallel and/or ProgressReporter class.:

  • maybe self.completed =+ 1
  • the thread is not properly canceled
  • the task being yield is empty/idling/doing nothing

Event Timeline

hashar updated the task description. (Show Details)

I have digged in the Quibble code and tried to reproduce the issue but haven't found anything conclusive.

I lost myself in cpython threading module and found https://github.com/python/cpython/commit/87255be6964979b5abdc4b9dcf81cdcfdad6e753 for https://bugs.python.org/issue40089 which got released in Python 3.7.8 (our images have 3.7.3 from Buster). But I have no idea whether that is related.

I haven't been able to find a way to reproduce the issue locally. An attempt I ran with Python 3.9.2

1# For T303270
2import logging
3from time import sleep
4
5from quibble.commands import Parallel
6from quibble.util import ProgressReporter
7
8logging.basicConfig(level=logging.DEBUG)
9
10
11class Printer():
12
13 def __init__(self, msg):
14 self.msg = msg
15
16 def execute(self):
17 sleep(3)
18 print(self.msg,)
19
20 def __str__(self):
21 return "print a message."
22
23
24build = Parallel(
25 name="foo",
26 steps=ProgressReporter(
27 desc="foo",
28 sleep_interval=0.2,
29 total=2,
30 iterable=[
31 Printer("line 1\nline 2\n<no trailing newline>"),
32 Printer("hello\nnewline at end\n"),
33 ],
34 )
35)
36build.execute()

I found an example of thread safeness which often not yields the expected results when run with Python 2.7 and pressing enter in the terminal. I might have tried it with Python 3.7 via docker but I can't remember whether it worked reliably or not.

1# Under Python 2.7 (and old python 3) running this would
2# not always yields 100. It is sometimes off
3#
4# For T303270
5
6import threading
7from time import sleep
8
9n = 0
10
11
12def foo():
13 global n
14 sleep(0.2)
15 n += 1
16
17
18threads = []
19for i in range(100):
20 t = threading.Thread(target=foo)
21 threads.append(t)
22
23for t in threads:
24 t.start()
25
26for t in threads:
27 t.join()
28
29print(n)

end of brain dump for today

Change 771820 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[integration/config@master] jjb: Remove parallel-npm-install flag

https://gerrit.wikimedia.org/r/771820

Change 771820 merged by jenkins-bot:

[integration/config@master] jjb: Remove parallel-npm-install flag

https://gerrit.wikimedia.org/r/771820

I lost myself in cpython threading module and found https://github.com/python/cpython/commit/87255be6964979b5abdc4b9dcf81cdcfdad6e753 for https://bugs.python.org/issue40089 which got released in Python 3.7.8 (our images have 3.7.3 from Buster). But I have no idea whether that is related.

As this task was mentioned in T374003: is it worth trying parallel npm install again now that integration has migrated off of Buster (T367534) and is hopefully on a Python version which includes that fix? (AFAICT the code for the --parallel-npm-install flag is still present in Quibble, though whether it still works is another question of course.)

Change #1098522 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[integration/config@master] Revert "jjb: Remove parallel-npm-install flag"

https://gerrit.wikimedia.org/r/1098522

is hopefully on a Python version which includes that fix?

The php7.4 Quibble images are still on Buster and thus Python 3.7.3. Looks like the biggest blocker is reproducing it with our code, but it sounded like a thread safeness issue?

Repro using P22106 and the Quibble 1.3.0 referred to in this task description:

podman run -it -v .:/src --entrypoint=bash docker-registry.wikimedia.org/releng/quibble-buster:1.3.0 -c 'while true; do python2 /src/P22106; done;'
100
100
100
100
100
100
100
96  # <--- wrong
100
100
100
98  # <--- wrong

Same for the 1.10.0:

hashar@contint1002:~$ docker run -it -v $(pwd):/src --entrypoint=bash docker-registry.wikimedia.org/releng/quibble-buster:1.11.0 -c 'while true; do python2 /src/P22106; done;'
100
100
100
99 # <--- wrong
100
100
100

Change #1098522 abandoned by Kosta Harlan:

[integration/config@master] Revert "jjb: Remove parallel-npm-install flag"

https://gerrit.wikimedia.org/r/1098522

From P22106 , threads are incrementing the same variable (n) and that is not thread safe.

There is a similar one with the logging handlers not being restored properly due to thread safety. I can't remember whether I have filed it as a task.

P22105 has code using Quibble ProgressReporter class.

Maybe the number of completed result is not properly incremented or too late.