- Puppetize the webservice
- Load test it to determine QoS
- Determine failover mechanism
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | bd808 | T166402 Program 7 Outcome 3: data services | |||
Resolved | ArielGlenn | T182540 get datset1001, ms1001 ready for decommission | |||
Resolved | • madhuvishy | T168486 Migrate customer-facing Dumps endpoints to Cloud Services | |||
Resolved | • madhuvishy | T188641 Set up the web service that serves dumps.wikimedia.org |
Event Timeline
Change 419522 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Setup web server config in distribution hosts
Change 419522 merged by Madhuvishy:
[operations/puppet@production] dumps: Setup web server config in distribution hosts
Change 419590 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] dumps: Set xmldumps server as localhost for testing
Change 419590 merged by Madhuvishy:
[operations/puppet@production] dumps: Set xmldumps server as localhost for testing
To failover between the two labstores for webservice:
- Switch do_acme to true for the intended primary server, and false for the backup in hieradata/hosts/<hostname>.yaml
- Switch the dumps_dist_active_web setting in https://github.com/wikimedia/puppet/blob/production/hieradata/common.yaml
- Switch the CNAME for dumps to the intended primary server in the dns repo -- dns/template/wikimedia.org
Running some load/performance tests. All tests from local machine.
Client specs:
MacBook Pro | 2.9 GHz Intel Core i5 (4 cores) | 8 GB 1867 MHz DDR3 Internet connection speeds: 64.7 Mbps download 92.2 Mbps upload Latency: 15 ms Server: San Francisco Bay Area, CA
Server: labstore1007 with nginx connection and rate limiting turned off:
How many connections can we serve simultaneously without returning connection errors or timeouts?
Test strategy: Keep incrementing concurrent connections until we start seeing timeouts/connection errors
# Test case 1: 1000 concurrent connections over 3 minutes ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 1000 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 1000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 111.80ms 4.58ms 468.86ms 87.72% Req/Sec 1.12k 86.31 1.32k 72.14% 1604433 requests in 3.00m, 722.21MB read Requests/sec: 8909.77 Transfer/sec: 4.01MB Load increases up to 1.3 # Test case 2: 1500 concurrent connections over 3 minutes ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 1500 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 1500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 154.72ms 11.71ms 1.11s 98.16% Req/Sec 1.21k 222.89 1.89k 74.86% 1733802 requests in 3.00m, 780.44MB read Requests/sec: 9627.13 Transfer/sec: 4.33MB Load increased up to 1.9 # Test case 3: 2000 concurrent connections over 3 minutes ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 2000 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 2000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 250.45ms 215.74ms 1.99s 88.26% Req/Sec 1.12k 168.17 2.21k 74.41% 1599687 requests in 3.00m, 720.07MB read Socket errors: connect 0, read 21, write 0, timeout 3022 Requests/sec: 8882.12 Transfer/sec: 4.00MB Load increased up to 1.6. We start seeing timeouts here. Dialing back down a bit # Test case 4: 1800 concurrent connections over 3 minutes ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 1800 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 1800 connections Thread Stats Avg Stdev Max +/- Stdev Latency 218.95ms 158.58ms 1.98s 92.05% Req/Sec 1.14k 159.82 2.23k 74.76% 1630121 requests in 3.00m, 733.77MB read Socket errors: connect 0, read 0, write 0, timeout 511 Requests/sec: 9052.13 Transfer/sec: 4.07MB Load increased up to 2. We are still seeing a few timeouts here. # Test case 5: 1700 concurrent connections over 3 minutes ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 1700 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 1700 connections Thread Stats Avg Stdev Max +/- Stdev Latency 190.97ms 96.28ms 1.91s 94.27% Req/Sec 1.17k 140.41 1.88k 71.20% 1672215 requests in 3.00m, 752.72MB read Socket errors: connect 0, read 0, write 0, timeout 48 Requests/sec: 9284.69 Transfer/sec: 4.18MB Load spikes up to 1.5. Still seeing a few time outs but this seems to be in the ballpark of concurrent connections we can serve.
How does the server perform under expected load?
Based on past data for dataset1001 - looking at the median of the last 1200 datapoints, it seems like normal load is about 85-100 concurrent connections over long periods of time, and since the number of connections is highly variable, it can spike up to 1500 connections for short windows (potentially few hours around times when new datasets show up?)
# Test case 1: 1 hour test 100 concurrent connections ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 100 -t 8 -d 60m https://208.80.155.106 Running 60m test @ https://208.80.155.106 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 107.64ms 37.18ms 1.99s 99.42% Req/Sec 111.25 11.14 145.00 85.53% 3186792 requests in 60.00m, 13.20GB read Socket errors: connect 0, read 0, write 0, timeout 81 Requests/sec: 885.16 Transfer/sec: 3.75MB Few timeouts, load <0.7 at peak. # Test case 2: 15 minute test 1500 concurrent connections ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 1500 -t 8 -d 15m https://208.80.155.106 Running 15m test @ https://208.80.155.106 8 threads and 1500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 426.39ms 373.56ms 2.00s 85.47% Req/Sec 321.15 60.41 0.88k 70.61% 2297363 requests in 15.00m, 9.51GB read Socket errors: connect 0, read 539, write 0, timeout 69733 Requests/sec: 2552.30 Transfer/sec: 10.82MB Lot more time outs here - 3% of total requests. Load <1 at peak.
How does the server perform under higher than normal load?
# Test case: 3 minute tests for 2000, 4000, 8000, 20000, 40000 concurrent connections ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 2000 -t 8 -d 1m https://208.80.155.106/analytics Running 1m test @ https://208.80.155.106/analytics 8 threads and 2000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 264.30ms 227.17ms 1.99s 86.91% Req/Sec 1.06k 169.88 2.47k 78.06% 501294 requests in 1.00m, 225.65MB read Socket errors: connect 0, read 0, write 0, timeout 1236 Requests/sec: 8341.20 Transfer/sec: 3.75MB ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 4000 -t 8 -d 1m https://208.80.155.106/analytics Running 1m test @ https://208.80.155.106/analytics 8 threads and 4000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 311.73ms 307.82ms 2.00s 88.52% Req/Sec 0.97k 196.04 2.39k 78.55% 455785 requests in 1.00m, 205.16MB read Socket errors: connect 0, read 0, write 0, timeout 10631 Requests/sec: 7587.21 Transfer/sec: 3.42MB Load still only about 1.2 max ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 8000 -t 8 -d 1m https://208.80.155.106/analytics Running 1m test @ https://208.80.155.106/analytics 8 threads and 8000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 325.71ms 321.13ms 2.00s 89.41% Req/Sec 0.93k 257.62 1.86k 72.08% 430553 requests in 1.00m, 193.81MB read Socket errors: connect 871, read 681, write 0, timeout 16108 Requests/sec: 7165.56 Transfer/sec: 3.23MB Load still only about 1.5 max ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 20000 -t 8 -d 1m https://208.80.155.106/analytics Running 1m test @ https://208.80.155.106/analytics 8 threads and 20000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 318.01ms 311.96ms 2.00s 89.95% Req/Sec 0.95k 287.37 2.76k 73.55% 443877 requests in 1.00m, 199.80MB read Socket errors: connect 12843, read 59, write 0, timeout 13855 Requests/sec: 7388.09 Transfer/sec: 3.33MB Load still holding good at ~ 1.5 ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 20000 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 20000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 311.49ms 310.93ms 2.00s 90.16% Req/Sec 0.98k 234.01 2.90k 73.17% 1370258 requests in 3.00m, 616.80MB read Socket errors: connect 13139, read 7381, write 0, timeout 54474 Requests/sec: 7608.67 Transfer/sec: 3.42MB Peak load < 1.2. ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 40000 -t 8 -d 3m https://208.80.155.106/analytics Running 3m test @ https://208.80.155.106/analytics 8 threads and 40000 connections Thread Stats Avg Stdev Max +/- Stdev Latency 313.46ms 310.63ms 2.00s 90.11% Req/Sec 1.00k 299.53 3.64k 74.38% 1386102 requests in 3.00m, 623.93MB read Socket errors: connect 36287, read 10500, write 0, timeout 52776 Requests/sec: 7695.42 Transfer/sec: 3.46MB Peak load < 1.3
What happens when we download large files?
Test case: 30 minute test 100 concurrent connections larger file ☁ ~ wrk -H "User-Agent: wrk madhu test" -c 100 -t 8 -d 30m https://208.80.155.106/enwiki/20180320/enwiki-20180320-pages-articles-multistream-index.txt.bz2 Running 30m test @ https://208.80.155.106/enwiki/20180320/enwiki-20180320-pages-articles-multistream-index.txt.bz2 8 threads and 100 connections Thread Stats Avg Stdev Max +/- Stdev Latency 0.00us 0.00us 0.00us nan% Req/Sec 0.17 1.29 10.00 98.33% 60 requests in 30.00m, 19.27GB read Socket errors: connect 0, read 0, write 0, timeout 60 Requests/sec: 0.03 Transfer/sec: 10.96MB We can ignore the timeouts since these requests are too large to complete. The server’s load is still under 1.3 which is pretty good! {F16578488} Network usage is also pretty low considering we have a 10 gig interface.
Change 423362 had a related patch set uploaded (by Madhuvishy; owner: Madhuvishy):
[operations/puppet@production] Revert "dumps: Set xmldumps server as localhost for testing"
Change 423362 merged by Madhuvishy:
[operations/puppet@production] Revert "dumps: Set xmldumps server as localhost for testing"