Page MenuHomePhabricator

Set up the web service that serves dumps.wikimedia.org
Closed, ResolvedPublic

Description

  • Puppetize the webservice
  • Load test it to determine QoS
  • Determine failover mechanism

Event Timeline

madhuvishy created this task.

Change 419522 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Setup web server config in distribution hosts

https://gerrit.wikimedia.org/r/419522

Change 419522 merged by Madhuvishy:
[operations/puppet@production] dumps: Setup web server config in distribution hosts

https://gerrit.wikimedia.org/r/419522

Change 419590 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Set xmldumps server as localhost for testing

https://gerrit.wikimedia.org/r/419590

Change 419590 merged by Madhuvishy:
[operations/puppet@production] dumps: Set xmldumps server as localhost for testing

https://gerrit.wikimedia.org/r/419590

To failover between the two labstores for webservice:

  1. Switch do_acme to true for the intended primary server, and false for the backup in hieradata/hosts/<hostname>.yaml
  2. Switch the dumps_dist_active_web setting in https://github.com/wikimedia/puppet/blob/production/hieradata/common.yaml
  3. Switch the CNAME for dumps to the intended primary server in the dns repo -- dns/template/wikimedia.org

Running some load/performance tests. All tests from local machine.

Client specs:

MacBook Pro | 2.9 GHz Intel Core i5 (4 cores) | 8 GB 1867 MHz DDR3

Internet connection speeds:
64.7 Mbps download
92.2 Mbps upload
Latency: 15 ms
Server: San Francisco Bay Area, CA

Server: labstore1007 with nginx connection and rate limiting turned off:

How many connections can we serve simultaneously without returning connection errors or timeouts?

Test strategy: Keep incrementing concurrent connections until we start seeing timeouts/connection errors

# Test case 1: 1000 concurrent connections over 3 minutes

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 1000 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   111.80ms    4.58ms 468.86ms   87.72%
    Req/Sec     1.12k    86.31     1.32k    72.14%
  1604433 requests in 3.00m, 722.21MB read
Requests/sec:   8909.77
Transfer/sec:      4.01MB

Load increases up to 1.3

# Test case 2: 1500 concurrent connections over 3 minutes

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 1500 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 1500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   154.72ms   11.71ms   1.11s    98.16%
    Req/Sec     1.21k   222.89     1.89k    74.86%
  1733802 requests in 3.00m, 780.44MB read
Requests/sec:   9627.13
Transfer/sec:      4.33MB

Load increased up to 1.9

# Test case 3: 2000 concurrent connections over 3 minutes

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 2000 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   250.45ms  215.74ms   1.99s    88.26%
    Req/Sec     1.12k   168.17     2.21k    74.41%
  1599687 requests in 3.00m, 720.07MB read
  Socket errors: connect 0, read 21, write 0, timeout 3022
Requests/sec:   8882.12
Transfer/sec:      4.00MB

Load increased up to 1.6. We start seeing timeouts here. Dialing back down a bit

# Test case 4: 1800 concurrent connections over 3 minutes

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 1800 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 1800 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   218.95ms  158.58ms   1.98s    92.05%
    Req/Sec     1.14k   159.82     2.23k    74.76%
  1630121 requests in 3.00m, 733.77MB read
  Socket errors: connect 0, read 0, write 0, timeout 511
Requests/sec:   9052.13
Transfer/sec:      4.07MB

Load increased up to 2. We are still seeing a few timeouts here.

# Test case 5: 1700 concurrent connections over 3 minutes

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 1700 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 1700 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   190.97ms   96.28ms   1.91s    94.27%
    Req/Sec     1.17k   140.41     1.88k    71.20%
  1672215 requests in 3.00m, 752.72MB read
  Socket errors: connect 0, read 0, write 0, timeout 48
Requests/sec:   9284.69
Transfer/sec:      4.18MB

Load spikes up to 1.5. Still seeing a few time outs but this seems to be in the ballpark of concurrent connections we can serve.

How does the server perform under expected load?

Based on past data for dataset1001 - looking at the median of the last 1200 datapoints, it seems like normal load is about 85-100 concurrent connections over long periods of time, and since the number of connections is highly variable, it can spike up to 1500 connections for short windows (potentially few hours around times when new datasets show up?)

# Test case 1: 1 hour test 100 concurrent connections

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 100 -t 8 -d 60m https://208.80.155.106
Running 60m test @ https://208.80.155.106
  8 threads and 100 connections

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   107.64ms   37.18ms   1.99s    99.42%
    Req/Sec   111.25     11.14   145.00     85.53%
  3186792 requests in 60.00m, 13.20GB read
  Socket errors: connect 0, read 0, write 0, timeout 81
Requests/sec:    885.16
Transfer/sec:      3.75MB

Few timeouts, load <0.7 at peak.

# Test case 2: 15 minute test 1500 concurrent connections

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 1500 -t 8 -d 15m https://208.80.155.106
Running 15m test @ https://208.80.155.106
  8 threads and 1500 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   426.39ms  373.56ms   2.00s    85.47%
    Req/Sec   321.15     60.41     0.88k    70.61%
  2297363 requests in 15.00m, 9.51GB read
  Socket errors: connect 0, read 539, write 0, timeout 69733
Requests/sec:   2552.30
Transfer/sec:     10.82MB

Lot more time outs here - 3% of total requests. Load <1 at peak.

How does the server perform under higher than normal load?

# Test case: 3 minute tests for 2000, 4000, 8000, 20000, 40000 concurrent connections

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 2000 -t 8 -d 1m https://208.80.155.106/analytics
Running 1m test @ https://208.80.155.106/analytics
  8 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   264.30ms  227.17ms   1.99s    86.91%
    Req/Sec     1.06k   169.88     2.47k    78.06%
  501294 requests in 1.00m, 225.65MB read
  Socket errors: connect 0, read 0, write 0, timeout 1236
Requests/sec:   8341.20
Transfer/sec:      3.75MB

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 4000 -t 8 -d 1m https://208.80.155.106/analytics
Running 1m test @ https://208.80.155.106/analytics
  8 threads and 4000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   311.73ms  307.82ms   2.00s    88.52%
    Req/Sec     0.97k   196.04     2.39k    78.55%
  455785 requests in 1.00m, 205.16MB read
  Socket errors: connect 0, read 0, write 0, timeout 10631
Requests/sec:   7587.21
Transfer/sec:      3.42MB

Load still only about 1.2 max

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 8000 -t 8 -d 1m https://208.80.155.106/analytics
Running 1m test @ https://208.80.155.106/analytics
  8 threads and 8000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   325.71ms  321.13ms   2.00s    89.41%
    Req/Sec     0.93k   257.62     1.86k    72.08%
  430553 requests in 1.00m, 193.81MB read
  Socket errors: connect 871, read 681, write 0, timeout 16108
Requests/sec:   7165.56
Transfer/sec:      3.23MB

Load still only about 1.5 max

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 20000 -t 8 -d 1m https://208.80.155.106/analytics
Running 1m test @ https://208.80.155.106/analytics
  8 threads and 20000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   318.01ms  311.96ms   2.00s    89.95%
    Req/Sec     0.95k   287.37     2.76k    73.55%
  443877 requests in 1.00m, 199.80MB read
  Socket errors: connect 12843, read 59, write 0, timeout 13855
Requests/sec:   7388.09
Transfer/sec:      3.33MB

Load still holding good at ~ 1.5

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 20000 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 20000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   311.49ms  310.93ms   2.00s    90.16%
    Req/Sec     0.98k   234.01     2.90k    73.17%
  1370258 requests in 3.00m, 616.80MB read
  Socket errors: connect 13139, read 7381, write 0, timeout 54474
Requests/sec:   7608.67
Transfer/sec:      3.42MB

Peak load < 1.2. 

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 40000 -t 8 -d 3m https://208.80.155.106/analytics
Running 3m test @ https://208.80.155.106/analytics
  8 threads and 40000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   313.46ms  310.63ms   2.00s    90.11%
    Req/Sec     1.00k   299.53     3.64k    74.38%
  1386102 requests in 3.00m, 623.93MB read
  Socket errors: connect 36287, read 10500, write 0, timeout 52776
Requests/sec:   7695.42
Transfer/sec:      3.46MB

Peak load < 1.3

What happens when we download large files?

Test case: 30 minute test 100 concurrent connections larger file

☁  ~  wrk -H "User-Agent: wrk madhu test" -c 100 -t 8 -d 30m https://208.80.155.106/enwiki/20180320/enwiki-20180320-pages-articles-multistream-index.txt.bz2
Running 30m test @ https://208.80.155.106/enwiki/20180320/enwiki-20180320-pages-articles-multistream-index.txt.bz2
  8 threads and 100 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.00us    0.00us   0.00us     nan%
    Req/Sec     0.17      1.29    10.00     98.33%
  60 requests in 30.00m, 19.27GB read
  Socket errors: connect 0, read 0, write 0, timeout 60
Requests/sec:      0.03
Transfer/sec:     10.96MB

We can ignore the timeouts since these requests are too large to complete. The server’s load is still under 1.3 which is pretty good!

{F16578488}

Network usage is also pretty low considering we have a 10 gig interface.

Change 423362 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] Revert "dumps: Set xmldumps server as localhost for testing"

https://gerrit.wikimedia.org/r/423362

Change 423362 merged by Madhuvishy:
[operations/puppet@production] Revert "dumps: Set xmldumps server as localhost for testing"

https://gerrit.wikimedia.org/r/423362