Debugging notebook cell action/state
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jprorama
	Aug 16 2017, 1:26 PM

Description

I'm running a test notebook to explore the sklearn operations on PAWS following a 20newsgroups example.

The notebook runs fine until I get to the grid search code designed to compute optimal parameters for the learning function. The GridSearchCV() requests a job engine that uses the number of available threads. When I reach this call in the notebook it begins excuting but never completes (ie. it remains with a * and doesn't get a number).

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)
gs_clf.best_score_

I'm trying to find out how to debug this step in PAWS. I've looked at the Kubernetes worker health to see if I can detect activity via a load spike. This doesn't provide any useful data, afaik.

I have downloaded and run the notebook on a local resource (4-core) and the above steps execute successfully, though the grid search does take about 2 minutes to run. The code on PAWS doesn't ever complete even after many minutes.

Are there any hints or suggests for debugging this type of error in PAWS?

Related Objects

Mentioned Here: T167086: Consider moving PAWS to its own Cloud VPS project, rather than using instances inside Toolforge

Event Timeline

Jprorama created this task.Aug 16 2017, 1:26 PM

Jprorama updated the task description. (Show Details)Aug 16 2017, 1:39 PM

The PAWS instances do have monitoring but the load doesn't seem to tick up in response to the grid search.

Not that this presents an effective end user debug option but just looking for any kind of response to a computational operation.

The tools project contains a list of all the instances. This includes the k8s nodes, grid workers, and PAWS fabric. This doesn't appear to include the nodes referenced above or their monitoring. Likely the separate PAWS project is the new platform being set up to isolate contention in T167086.

The paws.tools.wmflabs.org and paws.wmflabs.org still show the same Jupyter notebooks so there is some coordination between the backends. Need to further explore the cross over (if any) between the two projects.

It's possible that you just ran out of RAM and the kernel died? I think we have a 1G memory limit...

Thanks for the suggestion. I'll run a test with more limited RAM. BTW, is there any status/monitoring to observe jupyter kernel status from a notebook user end? I haven't found anything yet.

In T173416#3536333, @yuvipanda wrote:

It's possible that you just ran out of RAM and the kernel died? I think we have a 1G memory limit...

I have a test environment with a 1G memory limit and can reliably make the jupyter kernel fail in this environment at the gridsearch call as a result of limited memory (VM with 1G and running Jupyter inside a Docker container, mainly for simplified Jupyter setup). For reference, here's the startup line for docker (the notebook is loaded manually on the jupyter ui):

sudo docker run -i -t -p 8888:8888 continuumio/anaconda3 /bin/bash -c "/opt/conda/bin/conda install jupyter -y --quiet
 && mkdir /opt/notebooks && /opt/conda/bin/jupyter notebook --notebook-dir=/opt/notebooks --ip='*' --port=8888 --no-browser --allow-root" --net=host

I have seen the memory error both as a message popping up in the notebook saying the kernel died and as a traceback on the gs_clf.fit() call. Here's an example traceback for reference:

gs_clf = gs_clf.fit(twenty_train.data, twenty_train.target)

Exception in thread Thread-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 366, in _handle_workers
    pool._maintain_pool()
  File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 240, in _maintain_pool
    self._repopulate_pool()
  File "/opt/conda/lib/python3.6/multiprocessing/pool.py", line 233, in _repopulate_pool
    w.start()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 67, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

Surprisingly, although I can now get my 1G test environment to fail, I can no longer get the PAWS environment to fail on the grid search steps like it was when I opened this issue. I have run the notebook several times and it now succeeds each time instead of failing.

Were there any changes to the memory available over the past week?

A side note on my test environment: I had to resort to using a VM with a 1G mem limit and running jupyter inside a docker container on the VM in order to provide a guaranteed limit on the ram available to the notebook. I spent a lot of time trying to constrain the ram using just docker on my host with the --memory flag but was not able to enforce any memory limits. My host is Ubuntu 16.04 with upstream docker (17.03.1-ce). This was true even after booting the host kernel with the swapaccount=1. Even though the container reported a limit of 1G in memory.limit_in_bytes, code running in the container was able to allocate all the RAM on the host system. Odd.

Chicocvenancio triaged this task as Low priority.Feb 25 2018, 7:58 PM

Chicocvenancio moved this task from Backlog to Waiting information on the PAWS board.

Chicocvenancio closed this task as Resolved.May 25 2018, 2:18 PM

Chicocvenancio claimed this task.

Debugging notebook cell action/stateClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

Debugging notebook cell action/state
Closed, ResolvedPublic
Actions