Page MenuHomePhabricator

Test Performance of Marian NMT translation in stat cluster
Open, HighPublic

Description

In order to asses performance of Marian NMT in prod to best decide how/what hardware to order in prod we are going to run a test of how does the model perform in one of the stats machines.

Event Timeline

cc @JAllemandou so he can help @santhosh if he needs help. We recommend you use stat1006.eqiad.wmnet

Some docs about stats machines and general data access:

https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Shell_access
https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients

ssastry renamed this task from Test Performance of marian NT translation in stat cluster to Test Performance of Marian NMT translation in stat cluster.Mar 9 2020, 6:14 PM
ssastry updated the task description. (Show Details)

@JAllemandou I have filled https://phabricator.wikimedia.org/T247246 for access. Setting up MarianMT in that system will require installation of a set of software libraries with root access. This inlcude Intel MKL, cmake (may be these are already installed if the system is used for Numpy) and such developement libraries(https://marian-nmt.github.io/docs/). Let me know if there is any restrictions I should be aware of.

@JAllemandou I have filled https://phabricator.wikimedia.org/T247246 for access. Setting up MarianMT in that system will require installation of a set of software libraries with root access. This inlcude Intel MKL, cmake (may be these are already installed if the system is used for Numpy) and such developement libraries(https://marian-nmt.github.io/docs/). Let me know if there is any restrictions I should be aware of.

Added access! For an overview of resources: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Clients

About compiling code - I'd proceed by steps, trying to work with what we have now (already deployed on statistics nodes) and then reporting in here errors/missing-things/etc.. I'll try to support as much as possible :)

I'd suggest to use one of stat100[4-6] for the test, they have all the same config. stat1005 has a GPU and Debian Buster, that have more up to date dependencies.

Thanks. I was able to login. The first thing missing in stat1006 is cmake. Also I noticed I cannot connect to github(ping, wget fails. git works). Administratively prohibited?

Ok, I installed cmake as non-root users and set up marian, but the perfromace is even below my own laptop. Main difference is I compiled Marian with OpenBLAS, while MarianNMT recommeds intel MKL and says it is much faster than OpenBLAS. Installing IntelMKL require root access. I should also note that intel MKL is has its own license which I don't think compatible with FOSS licenses.
Steps:

wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB
sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
apt-get update
apt-get install -y --no-install-recommends intel-mkl-64bit-2019.5-075
rm -f GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB

@elukey is it possible to get this installed?

@santhosh importing the debian packages shouldn't be a big problem, but the license for sure. Looping in @MoritzMuehlenhoff to get his opinion before proceeding.

Context: we are trying to build MarianMT on a stat node to test its performances on bare metal hw. The final goal is to get some insights about what hw specs we'd need if we decided to put MarianMT in production in the future.

From a quick check, MKL doesn't appear to be FLOSS, but just a bundle of pre-compiled, shared libraries, essentially "freeware"? If this is just for performance evaluation, why does it need to happen in prod anyway? Can't the tests/benchmarks simply be run on a local laptop? Being shared hosts with other users, the stat* hosts are not a good place to run benchmarks anyway.

From a quick check, MKL doesn't appear to be FLOSS, but just a bundle of pre-compiled, shared libraries, essentially "freeware"? If this is just for performance evaluation, why does it need to happen in prod anyway? Can't the tests/benchmarks simply be run on a local laptop? Being shared hosts with other users, the stat* hosts are not a good place to run benchmarks anyway.

One of the motivations was also, I think, to figure out if problems like these could arise (before ordering any hardware). The stat boxes also offers GPUs that could be handy to have ready to use (even if MarianMT is sadly only supporting Nvidia CUDA...)

@santhosh at this point I'd try to test if OpenBLAS could be a viable alternative. Do you think that some perf tuning could be done, maybe with the support of upstream?

@santhosh at this point I'd try to test if OpenBLAS could be a viable alternative. Do you think that some perf tuning could be done, maybe with the support of upstream?

We are in touch with upstream, but I don't think such finetuning is with in the scope of language team or upstream in near future. We are running a system in labs with IntelMKL and that is faster than this stat1006. The intention of this test in stat1006 is to see if we can avoid Nvidia CUDA GPU. And to see if we can get our CPU based systems to a reasonable speed in production. Also to avoid benchmarking speed in labs systems.

@santhosh at this point I'd try to test if OpenBLAS could be a viable alternative. Do you think that some perf tuning could be done, maybe with the support of upstream?

We are in touch with upstream, but I don't think such finetuning is with in the scope of language team or upstream in near future. We are running a system in labs with IntelMKL and that is faster than this stat1006. The intention of this test in stat1006 is to see if we can avoid Nvidia CUDA GPU. And to see if we can get our CPU based systems to a reasonable speed in production. Also to avoid benchmarking speed in labs systems.

@santhosh my point is that the SRE team follows some standards about FLOSS and deploying IntelMKL in production seems to me not a viable option due to what Moritz said. I am not authoritative to make any call of course, I am only suggesting what I think it is the best path, that's it :)

@elukey no problem.

@Nuria what do you advice? Is there any other system that is not shared for doing this testing? Or should we check with SRE to see if Intel MKL depdency is a blocker even if we have such isolated system? Without Intel MKL, it does not make sense to do any performance testing. Marian NMT not only strongly suggest intel MKL, but also discourage OpenBLAS with Marian

Instead of relying on random comments let's get some actual number of the performance benefit of MKL over OpenBLAS?

Instead of relying on random comments let's get some actual number of the performance benefit of MKL over OpenBLAS?

@santhosh to avoid installing packages on stat boxes for a one-off test, if you have a set up already working on your laptop could you provide numbers to justify Intel MKL? We could even thing of manually adding the repos to the stat1006's apt config and install the packages, and report back.

@MoritzMuehlenhoff
@santhosh has done the testing with both MNL and OpenBlas and the former renders a much better performance. Preliminary testing has happen on labs and now we are moving to the pahse of doing testing in more "prod like" hardware. If we can run some tests with MKL, maybe we can pursue the path of getting an license for MKL that we are comfortable with?

OK, so to recap, I read two concerns:

First, the use of stat* boxes for a short-term performance evaluation project, a purpose that doesn't fit well with the equipment's purpose or with production requirements. This idea is actually something that came out of a meeting (where I was also in :)), and was proposed as a workaround in order to test in a non-shared/virtualized environment such as WMCS. This is indeed not a great fit and carries some risks, but I'm comfortable with bending the rules in the interest of time and given a) the temporary nature of the request (~a month or so?), b) the relative urgency (due to annual planning). Do folks feel strongly about that being an issue? We also have some spare boxes that we could use, but I fear it's going to take time (time we don't have) to deploy those.

Second, licensing concerns; this one is a bit more complicated (please bear with me!). As I understand it, Marian NMT's requires for the CPU version a library that can be either a) Intel MKL, or b) OpenBLAS, with upstream strongly performing the former, as that being faster. Unfortunately, it seems like Intel's MKL is not an open source product; from what I gather, it used to have a very restrictive per-server license, but as of February 2020 is licensed under the "Intel Simplified Software License". This is a relatively simple license, but still a proprietary and definitely not a FL/OSS license (it forbids modification, reverse engineering etc.; unclear ). That... is a problem. First of all, as Foundation staff we cannot accept (on behalf of the Foundation) arbitrary terms of use (incl. licenses, EULAs on registration forms etc.) without going through a process that includes (at minimum) a legal review; this is a blocker to a deployment to both servers but also staff laptops, so be careful with that :) Second, deploying this on WMCS would be a violation of the WMCS ToS (there's an even higher bar there).

We don't use non-open source software to serve production traffic as a matter of policy, so the value of testing with a component licensed under a proprietary license (even if we quickly/easily could do that, which is not the case...) would not provide a lot of value to this evaluation for annual planning purposes.

Performance-wise: from a cursory look on the web, it looks like targetting a CPU of >= Skylake could provide a speed bump and perhaps even comparable performance under certain microbenchmarks to MKL(?). stat1007 is a Skylake-era processor, so perhaps that could provide a performance boost(?). It's also worth noting that stat1007 is a mid-level CPU box (Intel Silver); for a production deployment we could use much more expensive and performant hardware. Unfortunately due to price we don't carry these as spares :(

Longer-term and given you are already in touch with upstream about this: it seems like Intel has actually released a product called Intel MKL-DNN, specifically for deep neural networks and also supporting GPUs/OpenCL, but licensed under an open source license (Apache2). Intel says that "although the Intel MKL-DNN library includes functionality similar to Intel® Math Kernel Library (Intel® MKL) 2017, it is not API compatible". I have really no idea how Marian NMT works or if this is a good idea in the first place, but I wonder if that's something we could ask upstream about :)

Hope this all makes sense, and let me know if I misunderstood or missed some of the concerns here.

Just making sure that @Pginer-WMF and @SubrahamanyamVarma are aware of licensing issues @santhosh would you care to try your test in stat1007 to see CPU perf there?

I think it will be worth thinking that we could hire for a contracting project to make a fully OS stack for Marian NMT. The place to start seems to make Marian use MKL-DNN rather than Intel MKL. This is something we should also have in mind for the annual plan

In addition to what Faidon said: Test results from Cloud VPS are not useful here; OpenBlas contains assembly-optimised functions for specific CPU models and through virtualisation these are no longer fully effective (the virt hosts don't expose all the CPU features by default compared to baremetal). Also, on Cloud VPS the test VMs are not the only tenant, so it highly depends on the load of the virt host.

How about we run these tests on stat1008? It was only racked yesterday, so we can make sure there no other users yet and it has the latest CPU generation (Cascade Lake), which would also be comparable to what we'd buy as a new server. And given that it's a fresh machine we can also simply reimage after Santhosh's tests are over.

Oh, that sounds perfect, let's do that :) We should also try with a build with the right make flags etc. (something like TARGET=SKYLAKEX like the FAQ says). Thanks all!

@santhosh Would you be so kind to move tests to stat1008, it is our newest machine with the best resources. @elukey can help with installs, having it be a fresh machine allows us to give you more leeway on software installed as we can later wipe it out clean. Let's please proceed with test which i think will be tremendously helpful

@santhosh stat1008 is ready, you can ssh to it and copy your stat1006's home with rsync -av stat1006.eqiad.wmnet::srv/home/santhosh/* .

@santhosh When you've setup your test environment and want to test OpenBLAS optimised for the CPU architecture of stat1008, let me know, I can help create a custom build and deploy it.

@santhosh stat1008 is ready, you can ssh to it and copy your stat1006's home with rsync -av stat1006.eqiad.wmnet::srv/home/santhosh/* .

I have copied the environment to stat1008. Now we need to install Intel MKL. Once that is done, I will build Marian NMT using intel MKL.
If we need to test with OpenBLAS on same machine, I need to compile MarianNMT with OpenBLAS and compare results with MarianNMT.

@santhosh stat1008 is ready, you can ssh to it and copy your stat1006's home with rsync -av stat1006.eqiad.wmnet::srv/home/santhosh/* .

I have copied the environment to stat1008. Now we need to install Intel MKL. Once that is done, I will build Marian NMT using intel MKL.
If we need to test with OpenBLAS on same machine, I need to compile MarianNMT with OpenBLAS and compare results with MarianNMT.

something like TARGET=SKYLAKEX like the FAQ says

@santhosh if I got it correctly the suggestion from Moritz and Faidon was to test openBLAS on stat1008 trying to compile it with Intel Skylake's optimizations.

Second, licensing concerns; this one is a bit more complicated (please bear with me!). As I understand it, Marian NMT's requires for the CPU version a library that can be either a) Intel MKL, or b) OpenBLAS, with upstream strongly performing the former, as that being faster. Unfortunately, it seems like Intel's MKL is not an open source product; from what I gather, it used to have a very restrictive per-server license, but as of February 2020 is licensed under the "Intel Simplified Software License". This is a relatively simple license, but still a proprietary and definitely not a FL/OSS license (it forbids modification, reverse engineering etc.; unclear ). That... is a problem. First of all, as Foundation staff we cannot accept (on behalf of the Foundation) arbitrary terms of use (incl. licenses, EULAs on registration forms etc.) without going through a process that includes (at minimum) a legal review; this is a blocker to a deployment to both servers but also staff laptops, so be careful with that :) Second, deploying this on WMCS would be a violation of the WMCS ToS (there's an even higher bar there).

We don't use non-open source software to serve production traffic as a matter of policy, so the value of testing with a component licensed under a proprietary license (even if we quickly/easily could do that, which is not the case...) would not provide a lot of value to this evaluation for annual planning purposes.

The alternative that was proposed is to follow up with upstream to check if the Intel MKL-DNN could be an alternative. As far as I understood from the above words, we are not installing Intel MKL (non-open source version to be clear) on any production node.

Ok, I think I misunderstood it. Sorry. I thought we can install intel MKL and reimage after testing.

having it be a fresh machine allows us to give you more leeway on software installed as we can later wipe it out clean. Let's please proceed with test which i think will be tremendously helpful

@Nuria did you mean to install openBLAS in stats1008 by the above statement? Sorry, if I misunderstood. Are we expecting that openBLAS will perfrom better in stats1008 than stats1006?

@santhosh the suggestion came from Faidon and Moritz, in two parts:

  1. Faidon's point was that it doesn't seem worth to invest time in testing a library that we'll not be able to deploy in production due to license issue.

We don't use non-open source software to serve production traffic as a matter of policy, so the value of testing with a component licensed under a proprietary license (even if we quickly/easily could do that, which is not the case...) would not provide a lot of value to this evaluation for annual planning purposes.

  1. OpenBlas seems to be able to leverage hardware features of recent CPUs, so stat1008 was proposed as a way to test it quickly.

In addition to what Faidon said: Test results from Cloud VPS are not useful here; OpenBlas contains assembly-optimised functions for specific CPU models and through virtualisation these are no longer fully effective (the virt hosts don't expose all the CPU features by default compared to baremetal). Also, on Cloud VPS the test VMs are not the only tenant, so it highly depends on the load of the virt host.

How about we run these tests on stat1008? It was only racked yesterday, so we can make sure there no other users yet and it has the latest CPU generation (Cascade Lake), which would also be comparable to what we'd buy as a new server. And given that it's a fresh machine we can also simply reimage after Santhosh's tests are over.

Are we expecting that openBLAS will perfrom better in stats1008 than stats1006?

Hopefully yes building it with the proper SKYLAKEX flags, as described in https://github.com/xianyi/OpenBLAS/wiki/faq

let us know if you need more help with this @santhosh

Due to the recent changes in priorities, I may not get back to this testing immediately. But there is a dependency to have OpenBLAS with custom configurations installed in the system. This will require help from @elukey since I don't have enough persmissions on that system.

Due to the recent changes in priorities, I may not get back to this testing immediately. But there is a dependency to have OpenBLAS with custom configurations installed in the system. This will require help from @elukey since I don't have enough persmissions on that system.

Anytime! When you'll have time for this let's sync so we can find some overlapping hours on IRC to work together :)

I had a closer look into a rebuild of OpenBlas for Skylake/avx512, but it turns out this isn't actually needed thanks to our distro of choice :-)

The instructions to rebuild with TARGET=SKYLAKEX are for anyone using the source distribution, but the Debian maintainers are building the package with dynamic CPU detection so that all archs are covered:

On amd64, arm64 and i386, libopenblas-base provides a multiple architecture library, thanks to our distro of choice!  All kernels are included in the library and the one matching your architecture is selected at run time. Recompiling locally should bring minimal performance improvement.
On the contrary, on other archs, the package is compiled with minimal optimizations, so that it can run on all hardware. You may want to recompile it locally for optimal performance.

@santhosh As such, given that stat1008 is a > skylake CPU server and the Buster version of OpenBLAS is deployed, you should be able to run your tests whenever your priorities allow for it. Let me know if you need any other assistance and feel free to directly ping me on IRC!

@MoritzMuehlenhoff Thanks. Just to confirm, I can just build marian(that will use CPU optmized openblas) and do tests right? I need apachebench(ab) to tests and stat1008 does not have that. Can you install it(apache-utils package)?

Exactly, the OpenBLAS available in Debian Buster (which stat1008 uses provides CPU-optimized computation kernels). I've just installed apache2-utils on the host, feel free to also ping me on IRC (moritzm) if you need other packages installed (we'll be reimaging the server anyway after the tests are done).

LoadOpenblas(stat1008) performance testingGPU based OpusMT
 10 requests concurrency 1
1This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
2Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
3Licensed to The Apache Software Foundation, http://www.apache.org/
4
5Benchmarking localhost (be patient).....done
6
7
8Server Software: TornadoServer/6.0.4
9Server Hostname: localhost
10Server Port: 8888
11
12Document Path: /api/translate
13Document Length: 1065 bytes
14
15Concurrency Level: 1
16Time taken for tests: 74.010 seconds
17Complete requests: 10
18Failed requests: 0
19Total transferred: 12190 bytes
20Total body sent: 11220
21HTML transferred: 10650 bytes
22Requests per second: 0.14 [#/sec] (mean)
23Time per request: 7401.006 [ms] (mean)
24Time per request: 7401.006 [ms] (mean, across all concurrent requests)
25Transfer rate: 0.16 [Kbytes/sec] received
26 0.15 kb/s sent
27 0.31 kb/s total
28
29Connection Times (ms)
30 min mean[+/-sd] median max
31Connect: 0 0 0.0 0 0
32Processing: 7188 7401 160.9 7431 7702
33Waiting: 7188 7401 160.8 7431 7702
34Total: 7188 7401 160.9 7431 7702
35
36Percentage of the requests served within a certain time (ms)
37 50% 7431
38 66% 7431
39 75% 7484
40 80% 7577
41 90% 7702
42 95% 7702
43 98% 7702
44 99% 7702
45 100% 7702 (longest request)
1This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
2Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
3Licensed to The Apache Software Foundation, http://www.apache.org/
4
5Benchmarking 104.197.224.16 (be patient).....done
6
7
8Server Software: nginx/1.14.0
9Server Hostname: 104.197.224.16
10Server Port: 80
11
12Document Path: /api/translate
13Document Length: 1097 bytes
14
15Concurrency Level: 1
16Time taken for tests: 12.799 seconds
17Complete requests: 10
18Failed requests: 0
19Total transferred: 12720 bytes
20Total body sent: 11210
21HTML transferred: 10970 bytes
22Requests per second: 0.78 [#/sec] (mean)
23Time per request: 1279.883 [ms] (mean)
24Time per request: 1279.883 [ms] (mean, across all concurrent requests)
25Transfer rate: 0.97 [Kbytes/sec] received
26 0.86 kb/s sent
27 1.83 kb/s total
28
29Connection Times (ms)
30 min mean[+/-sd] median max
31Connect: 238 414 259.8 282 1010
32Processing: 673 866 164.5 887 1207
33Waiting: 673 865 164.5 887 1207
34Total: 938 1280 335.6 1275 2066
35
36Percentage of the requests served within a certain time (ms)
37 50% 1275
38 66% 1381
39 75% 1441
40 80% 1455
41 90% 2066
42 95% 2066
43 98% 2066
44 99% 2066
45 100% 2066 (longest request)
100 requests concurrency 10
1santhosh@stat1008:~$ ab -n 100 -c 10 -p English.txt -T "application/json" http://localhost:8888/api/translate
2This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
3Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
4Licensed to The Apache Software Foundation, http://www.apache.org/
5
6Benchmarking localhost (be patient).....done
7
8
9Server Software: TornadoServer/6.0.4
10Server Hostname: localhost
11Server Port: 8888
12
13Document Path: /api/translate
14Document Length: 1065 bytes
15
16Concurrency Level: 10
17Time taken for tests: 727.528 seconds
18Complete requests: 100
19Failed requests: 0
20Total transferred: 121900 bytes
21Total body sent: 112200
22HTML transferred: 106500 bytes
23Requests per second: 0.14 [#/sec] (mean)
24Time per request: 72752.764 [ms] (mean)
25Time per request: 7275.276 [ms] (mean, across all concurrent requests)
26Transfer rate: 0.16 [Kbytes/sec] received
27 0.15 kb/s sent
28 0.31 kb/s total
29
30Connection Times (ms)
31 min mean[+/-sd] median max
32Connect: 0 0 0.1 0 0
33Processing: 7309 68807 13392.6 72702 73311
34Waiting: 7309 68807 13392.6 72702 73311
35Total: 7310 68807 13392.6 72702 73311
36
37Percentage of the requests served within a certain time (ms)
38 50% 72702
39 66% 72803
40 75% 72871
41 80% 72903
42 90% 73016
43 95% 73127
44 98% 73268
45 99% 73311
46 100% 73311 (longest request)
1This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
2Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
3Licensed to The Apache Software Foundation, http://www.apache.org/
4
5Benchmarking 104.197.224.16 (be patient).....done
6
7
8Server Software: nginx/1.14.0
9Server Hostname: 104.197.224.16
10Server Port: 80
11
12Document Path: /api/translate
13Document Length: 1097 bytes
14
15Concurrency Level: 10
16Time taken for tests: 47.288 seconds
17Complete requests: 100
18Failed requests: 0
19Total transferred: 127200 bytes
20Total body sent: 112100
21HTML transferred: 109700 bytes
22Requests per second: 2.11 [#/sec] (mean)
23Time per request: 4728.760 [ms] (mean)
24Time per request: 472.876 [ms] (mean, across all concurrent requests)
25Transfer rate: 2.63 [Kbytes/sec] received
26 2.32 kb/s sent
27 4.94 kb/s total
28
29Connection Times (ms)
30 min mean[+/-sd] median max
31Connect: 238 339 170.9 251 978
32Processing: 1082 4062 1108.1 4005 7183
33Waiting: 1081 4061 1108.2 4001 7183
34Total: 1362 4401 1099.0 4283 7484
35
36Percentage of the requests served within a certain time (ms)
37 50% 4283
38 66% 4325
39 75% 4424
40 80% 4574
41 90% 6709
42 95% 7041
43 98% 7437
44 99% 7484
45 100% 7484 (longest request)

Note on GPU: NVIDIA Tesla P4 GPU on 16 vCPUs, 104 GB memory with Ubuntu 18.04 and CUDA, 512 GB SSD system. Requestes over the internet.
Note on Stat1008: No requests over internet. Requests are from same host.
All tests use same input paragraph and same version of Marian, OpusMT.

Observations
Openblas takes about 7seconds(mean) for an average paragraph translation while GPU complete it in less than a second(mean). GPU gives faster per request response under concurrency 10 than concurrency 1. Since there is no intel MKL based system to compare, I tried it in my own laptop(4 core CPU, Intel core i7 7th gen, 16GB RAM) and took 6seconds(mean).

So Openblas is stat1008 with 32 cores is slower than my personal laptop with intel MKL. And GPU gives under a second translation. Openblas optimized for CPU architecture definitely improved the performance. stat1006 with openblas was taking 18 seconds for same input while stat1008 takes 7seconds.

Out of interest, what CPU does your personal laptop use? (Ideally /proc/cpuinfo if you're on Linux).

It appears to me that the next step would be to compare the test result with OpenBLAS von stat1008 with an MKL test install on stat1008? Do you need anything else installed other than intel-mkl-64bit-2019.5-075 for this?

NOTE: I have documented the steps for doing this installation and performance test at https://wikitech.wikimedia.org/wiki/User:Santhosh/OpusMT_Setup

Out of interest, what CPU does your personal laptop use? (Ideally /proc/cpuinfo if you're on Linux).

It is X1 carbon gen5 laptop. CPU(4 cores):

vendor_id       : GenuineIntel
cpu family      : 6
model           : 142
model name      : Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
stepping        : 9
microcode       : 0xca
cpu MHz         : 804.678
cache size      : 4096 KB

It appears to me that the next step would be to compare the test result with OpenBLAS von stat1008 with an MKL test install on stat1008? Do you need anything else installed other than intel-mkl-64bit-2019.5-075 for this?

Well, we can't install MKL as it is proprietary right?

There is an active parallel activity going on at upstream https://github.com/marian-nmt/marian-dev/pull/595 based on https://github.com/kpu/intgemm/

Hi @santhosh,
I have thought of the project and I have a question for you: Have you tested that multiple models can be loaded and queried in parallel on a GPU?
That question is related to my understanding of the scope: since the goal is to serve many models (plenty language-pairs), running multiple models over a single GPU seams unavoidable.
Thanks :)

Hi @santhosh,
I have thought of the project and I have a question for you: Have you tested that multiple models can be loaded and queried in parallel on a GPU?

Yes, This has been tested and works fine. But the performance is not independent of the models we load. Also, each model require lot of RAM, so there is a limit.

I am going to close this ticket as we did several tests on different configurations and found that the perfromance is slow in if MarianNMT is not using IntelMKL. We also found GPU performs better. The decisions based on this evaluations are yet to happen - I have no updates on that.

Nuria moved this task from Event Platform to Mentoring on the Analytics board.
ssastry mentioned this in Unknown Object (Task).Oct 18 2020, 7:47 PM