Page MenuHomePhabricator

Fixing response time and enabling larger models in Wembedder
Open, Needs TriagePublic

Description

I have a machine learning webservice on the Toolforge called Wembedder. Currently it is loading a large ~500 MB model and I sometimes experience a considerable delay when the Webservice is queried. As far as I understand the Kubernetes spawn 4 process and I believe that the model might run into a problem with memory limitation.

At the same time I would like to retrain the model with new Wikidata data and enlarge the model (probably considerably). Obviously, I will run into further memory issue on the Toolforge.

I am wondering whether it would be possible to get a (virtual?) machine with larger memory. Should I request a Cloud-VPS here: https://phabricator.wikimedia.org/project/view/2875/ I see that "m1.large" is 8GB. Would it be possible to get more than that?

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 25 2018, 10:06 AM

Please use the template from https://phabricator.wikimedia.org/project/view/2880/

Project Name: <projectname>
Type of quota increase requested: <cpu/ram/instance count/floating ip>
Amount of quota increase: <Something like: enough create one large instance>
Reason: <Why is this quota increase required?>
bd808 added a subscriber: bd808.

I have a machine learning webservice on the Toolforge called Wembedder.
[...snip...]
I am wondering whether it would be possible to get a (virtual?) machine with larger memory.

The Toolforge project does not offer dedicated virtual machines for tools. (There have been some dedicated job grid queues and runners in the past, but this is not exactly the same thing.) If you are running a tool that has outgrown Toolforge the next step is to request a dedicated Cloud VPS project using the process described at https://phabricator.wikimedia.org/project/profile/2875/.

Moving from Toolforge to a Cloud VPS project also means accepting responsibility for full maintenance of the virtual machines as well as your application. This is generally easier to accomplish with co-maintainers, so you may want to look into finding others to help you with your project.

I see that "m1.large" is 8GB. Would it be possible to get more than that?

Custom image "flavors" are possible. We do not have a well documented process for requesting a custom flavor. You can describe your needs in the project creation ticket and we can work with you from there.

Your tool currently seems to receive something between 50 and 200 raw HTTP requests per day so it may be difficult to justify giving it a large pool of dedicated resources without a better description of how this work will be impactful for the Wikimedia community in the longer term.

Thanks very much for the explanation.

I am wondering whether it would be possible to change the script for the invocation of Kubernetes so only one or two threads are started instead of the four. That would help a bit on memory issue I suppose?

Most of the usage of Wembedder comes probably from Scholia https://tools.wmflabs.org/scholia/

I am wondering whether it would be possible to change the script for the invocation of Kubernetes so only one or two threads are started instead of the four. That would help a bit on memory issue I suppose?

Your wembedder tool is currently running on the Kubernetes cluster. This actually took me a little digging to figure out because for some reason the state of the tool's $HOME/service.manifest file and the state of its jobs are out of sync. This should be fixable by issuing the commands:

$ webservice stop
$ webservice --backend=kubernetes python start

There is currently no method for a tool to customize the Kubernetes deployment manifest that the webservice command creates. It is possible however to customize the configuration of the uwsgi container by creating a $HOME/www/python/uwsgi.ini file. I think you can adjust the number of worker threads by creating a config file that looks something like this:

$HOME/www/python/uwsgi.ini
[uwsgi]
workers = 2

See https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web#Default_uwsgi_configuration for additional information.

Thanks. It seems that the workers = 2 in $HOME/www/python/uwsgi.ini has worked. The response in Wembedder itself as well as in Scholia seems much better, and I do not see DAMN ! worker 4 (pid: 20) died, killed by signal 9 :( trying respawn ... in the log file anymore.

I have a $HOME/uwsgi.ini with the following content:

[uwsgi]
plugin = python,python3
http-socket = :8000
chdir = /data/project/wembedder/www/python/src/
logto = /data/project/wembedder/uwsgi.log
callable = app
manage-script-name = true
workers = 2
mount = /wembedder=/data/project/wembedder/www/python/src/app.py
die-on-term = true
strict = true
venv = /data/project/wembedder/www/python/venv

I suppose that is not read because it is in the wrong directory?

bd808 added a comment.Jun 27 2018, 3:31 PM

I have a $HOME/uwsgi.ini with the following content:
[...snip...]
I suppose that is not read because it is in the wrong directory?

Correct. I think I can confidently call this a bad design decision on our part for the webservice command. A $HOME/uwsgi.ini file will be processed if you are using the grid engine backend, but the Kubernetes backend looks for $HOME/www/python/uwsgi.ini instead.

Vvjjkkii renamed this task from Fixing response time and enabling larger models in Wembedder to 9caaaaaaaa.Jul 1 2018, 1:01 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
JJMC89 renamed this task from 9caaaaaaaa to Fixing response time and enabling larger models in Wembedder.Jul 1 2018, 3:39 AM
JJMC89 raised the priority of this task from High to Needs Triage.
JJMC89 updated the task description. (Show Details)
JJMC89 added a subscriber: Aklapper.