Page MenuHomePhabricator

Performance test the service
Closed, ResolvedPublic8 Estimated Story Points

Description

Update 2018/01/02: Findings

See T178278#3852240 for @pmiazga's initial report.


We now know that we're looking at ~1 month to get the headless Chromium based render service deployed. It might take longer to get it undeployed if it isn't a viable replacement for the Electron-based render service. If we have an understanding of how the new service performs ahead of time, then we might be able to save folk (including us) a lot of time.

Open Questions

  1. What?

From the parent task:

We should be providing median and 95/99th percentile timings for the service rendering S/M/L/XL/XL… pages and resource consumption on the server during those test runs. Moreover, these tests should be repeatable (i.e. there's a script that we can run by anyone [and at different times!]).

We all acknowledge that we'd have to take the results with a grain of salt as we'd be using a VPS-hosted instance but they'll be helpful in the interim while we're trying to get the service into production.

It might be worth focusing more on robustness than simple-page latency, as that is the more critical issue with Electron. Previously, I tested with a few very large articles (see T142226#2537844). This tested timeout enforcement. Testing with a simulated overload (many concurrent requests for huge pages) could also be useful to ensure that concurrency limits and resource usage limits are thoroughly enforced.

  1. Where?

Performance testing on a VPS isn't ideal but it's cheaper than testing on production hardware (the time cost of deploying the service is ~1 month). @bmansurov has already set up a VPS to test the service, which is accessible here: http://chromium-pdf.wmflabs.org.

Per T178278#3726445, we also have access to bare metal with a big pipe. @pmiazga would be required to set up the service on that server.

List of test articles

long articles

https://en.wikipedia.org/wiki/List_of_members_of_the_Lok_Sabha_(1952%E2%80%93present) - long list article

https://en.wikipedia.org/wiki/Battle_of_Mosul_(2016%E2%80%932017) - long article, lots of images

https://en.wikipedia.org/wiki/Hendrick_Motorsports - long article, tables, images

https://en.wikipedia.org/wiki/Panama_Papers - long article, images

https://en.wikipedia.org/wiki/List_of_compositions_by_Franz_Schubert - long list article

top 5 printed articles

https://en.wikipedia.org/wiki/Mahatma_Gandhi

https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam

https://en.wikipedia.org/wiki/Vijayadashami

https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A4%B9%E0%A4%BE%E0%A4%A4%E0%A5%8D%E0%A4%AE%E0%A4%BE_%E0%A4%97%E0%A4%BE%E0%A4%82%E0%A4%A7%E0%A5%80

https://en.wikipedia.org/wiki/Halloween

long right to left articles

https://ar.wikipedia.org/wiki/%D9%85%D8%B5%D8%B1

https://he.wikipedia.org/wiki/%D7%AA%D7%9C_%D7%90%D7%91%D7%99%D7%91-%D7%99%D7%A4%D7%95

stubs

https://en.wikipedia.org/wiki/Berenice_Mu%C3%B1oz

https://en.wikipedia.org/wiki/Benita_Mehra

Things to double check

During T178501#3767355 @pmiazga found out that some rendered PDFs are incomplete. This happens only when the queue is full and service is constantly handling at least 2-3 concurrent requests at once. During performance testing please verify that all PDF files are rendered correctly and contain all pages.

A/C

  • Generate PDFs using the list of articles above. Verify that the PDF contents look good, i.e. the PDFs contain actual article text, and not some error message from RESTBase or somewhere else. Upload the resulting PDFs here.

See T178278#3854820 onwards.

  • Measure and report times spent rendering articles in succession. The service logs contain this information.
  • Use siege to report the performance of the service. Create various scenarios where you have a combination of articles (from above) and control the number of concurrent requests. Analyze the service logs to make sure that nothing unexpected happens. For example, if the concurrency in siege is set as 5 and it's set in service as 3, then make sure that after the 3rd request, the next two are aborted immediately.
  • Monitory the system load while doing these tests. See T175853#3616304 for reference.

Related Objects

StatusSubtypeAssignedTask
OpenNone
Resolved mobrovac
InvalidNone
Resolvedovasileva
ResolvedNone
Resolvedphuedx
Resolvedphuedx
Resolvedovasileva
ResolvedABorbaWMF
ResolvedABorbaWMF
DeclinedNone
Declinedovasileva
DeclinedNone
ResolvedABorbaWMF
ResolvedSpikephuedx
InvalidNone
DeclinedNone
OpenNone
Resolvedovasileva
Resolved Tbayer
Resolved Tbayer
Resolvedovasileva
ResolvedCKoerner_WMF
Resolved Tbayer
Resolvedovasileva
ResolvedJdlrobson
ResolvedABorbaWMF
Resolvedphuedx

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

A few extra things we should take into consideration:

  • Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc
  • Books rendering - are there any other variables we're concerned with for books outside of length?

A few extra things we should take into consideration:

  • Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc

I think it's important that we test this, but would we call it a performance test? Isn't this task specifically about testing if the service can handle heavy traffic?

  • Books rendering - are there any other variables we're concerned with for books outside of length?

As it currently stands, the service is configured to handle single articles (timeouts are not big enough for books to render correctly). I wonder if this should be the next step only if the performance test of single articles passes.

Also we'll need a list of articles to test against. Do we have one?

A few extra things we should take into consideration:

  • Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc

I think it's important that we test this, but would we call it a performance test? Isn't this task specifically about testing if the service can handle heavy traffic?

This would be about heavy traffic and an extra check that we can support two different styles based on the browser (I think the latter might require its own task)

  • Books rendering - are there any other variables we're concerned with for books outside of length?

As it currently stands, the service is configured to handle single articles (timeouts are not big enough for books to render correctly). I wonder if this should be the next step only if the performance test of single articles passes.

Sounds good

Also we'll need a list of articles to test against. Do we have one?

I can put together a list. My thoughts would be:

  • 5-6 very long articles
  • 5-6 article with a lot of images/infoboxes
  • 5-6 very short articles
  • the top 5-6 printed articles

OK, so it looks like the plan is to create two more tasks: (1) test the appearance of articles with and without mobile styles, (2) test how the server handles books (which should be done only after the service is ready for testing).

In addition to the article types you listed, I'd throw in:

  • articles in RTL languages
  • a minimalistic article that only has a couple of words (or no content) so that we can measure the base case
  • Look at current electron logs and identify the most frequently printed articles, and test them too.

OK, so it looks like the plan is to create two more tasks: (1) test the appearance of articles with and without mobile styles

I think we would want to test three things:

  • whether the service can simultaneously produce both desktop and mobile styles
  • whether chromium can support and create PDFs with mobile styles (not sure if we need to do this as we're currently serving the mobile styles on google chrome, but it's probably better to double-check)
  • the performance of the service when mobile styles are used (with the set of articles above)

Does that sound right? I can set up the task.

We also need to analyze access logs from electron service and try to replay the real traffic. Synthetic tests are not the best pick when it comes to performance testing.

@ovasileva yes, that sounds right.
@pmiazga, do you think that testing the service with real traffic is part of this task, or is it something we need to do after working on this task?

IMHO this is a part of this task, as we want to verify that the newly created service is good enough to handle the production traffic.

Also, would we be able to test books using the current concatenation code prior to deploying the service (but after this initial round of testing)?

I've left a comment on T181513. As for testing books, yes we need to find a way to either test the service using Extension:Collection or something else. But remember, the service's current goal is to replace Electron as is. We haven't thought about books in this context yet as I mentioned in T178278#3779400.

@ovasileva yes, I've updated the description. That said, we should stay focused and test the performance of the service not other things.

@ovasileva yes, I've updated the description. That said, we should stay focused and test the performance of the service not other things.

Thank you! And agreed - this would be just a quick check - we'll look further in separate tasks if necessary.

@pmiazga @Jdlrobson since the blocker (T181623) is blocked, I think we can move ahead with this task now.

I've updated the chromium-pdf.reading-web-staging.eqiad.wmflabs server with the latest chromium-render code.

Yup, I'll start working on it today.

Also note that pm2 is keeping the service alive.

Testing round one.

Type: synthetic test, set of 50 urls, no pauses. I focused mostly on concurrency/response rates, not on server load as the chromium renderer was a bit outdated. It was more a "test round".

Time taken: 12 hours 30 minutes
Concurrency: 5-7 concurrent requests at once
Requests sent: 29738
Success rate: 100%
Error rate: 0%
Throughput: ~ 39 req/min
Data rate: 865.47KB/sec
Stored: 29GB of PDF files

Response rates

Average response rate: 9.815sec
Median: 7.193sec
Fastest response: 1.872sec

99% requests took less than 25.241 seconds to finish
95% requests took less than 21.439 seconds to finish
90% requests took less than 19.055 seconds to finish

Server load

CPU: 2 threads,100%
Memory consumption - less than 3.5GB

Testing environemnt:

Two machines

  • main device - laptop, i7 + 16GB of RAM, SSD drive (for storing data), using JMeter on 50MBit connection
  • additional load - virtualized server on bare metal, 2xXeon L5335, 8GB of RAM, ~150Mbit conection
Other findings:

I used a set of 50 articles, what means I should have ~50 sets of the same files. System stored 22465 files, there are 1327 distinct file sizes. Some PDFs are broken, some are not rendered completely. I'll analyze PDF files and submit a separate patch to verify/fix broken PDFs

Thnx @pmiazga for this first round. One thing that could be also interesting to measure is how is the response latency influenced buy heavy and/or long-running tasks. For example, what is the mean/average response time when smaller/average-sized articles are being rendered if long-running tasks are run in parallel or not. Basically, the idea would be to measure whether continuously rendering something like Barack Obama influences rendering times of other articles.

As for the PDF files themselves, the quality of the files should probably be inspected separately. When it comes to performance testing, accounting for storage is not relevant and can easily slow things down (especially if the system has to write 22k+ files).

@mobrovac I stored files as I want to verify that files are rendered correctly. Transfer rate is less than 1MB/s and my SSD drive can write up 250MB/s, ~29GB over 12hours is not that much, it should not slow down the load testing overall. I'll definitely have couple rounds where I do not store data (typical performance/benchmark test), but for the regular load IMHO it doesn't hurt.
I like the idea of having couple long-running tasks in the background. I'll definitely do that.

@bd808 -FYI, we're testing the newly created chromium service (http://chromium-pdf.wmflabs.org), don't get alarmed when this service gets flooded with requests or constantly uses all resources (cpu&mem).

@Tbayer: @pmiazga and I had discussed replaying a day's worth of logs from the wmf_raw.webrequest table against the service (obviously, we'd only consider requests to /api/rest_v1/page/pdf/…). In order to do this, @pmiazga would have to copy the logs to his development machine.

My understanding is that that table contains potentially sensitive data and we should ask permission before making a copy of the logs. Is that correct? If so, who do we ask?

We also discussed only exporting the timestamp and the URL, which seems to minimize the amount of sensitive data that we export.

For next 3-5 days I'll keep sending 5-10 concurrent requests all the time plus from time to time I'll generate PDF to verify everything is still correct. I also need to prepare couple different testing scenarios (like generate set of "only non-latic languages").

So far avg response time is 10 seconds, service is able to handle ~28 requests per minute, service did not collapse so far, the only issue is that some PDFs are not rendered correctly.

Now I don't need to spend the whole day on this task, only around 2-3 hours daily to verify the results (I configured one server to constantly ask chromium service to render some articles) , I'll keep this task in doing and start working/reviewing other tasks.

@Tbayer: @pmiazga and I had discussed replaying a day's worth of logs from the wmf_raw.webrequest table against the service (obviously, we'd only consider requests to /api/rest_v1/page/pdf/…). In order to do this, @pmiazga would have to copy the logs to his development machine.

My understanding is that that table contains potentially sensitive data and we should ask permission before making a copy of the logs. Is that correct? If so, who do we ask?

We also discussed only exporting the timestamp and the URL, which seems to minimize the amount of sensitive data that we export.

Sorry, only seeing this now (@pmiazga and I had earlier discussed the use of webrequest data for this purpose on IRC, but not this particular question).

Yes, the webrequest data contains sensitive information, hence it is kept private and is only accessible to people under NDA who in addition have the required user rights. IANAL, but it seems common sense to avoid permanently storing the full logs on a personal development machine. Exporting only timestamp and URL sounds like a great idea.

By the way, any particular reason for using wmf_raw.webrequest instead of the more commonly used wmf.webrequest? (cf. documentation)

By the way, any particular reason for using wmf_raw.webrequest instead of the more commonly used wmf.webrequest? (cf. documentation)

A typo on my part. If we're exporting timestamps and URLs, then wmf.webrequest makes perfect sense.

TL;DR;

Service is performant and stable enough to be put on production. The most performant settings are to set up a number of concurrent renders to {CPU_CORES_COUNT} * 2 which
offers best rendering-to-waiting time. In comparison, when in idle state the service can generate short PDF in ~2secs and under high load same requests would take 50% more time.
I tested service for ~2 weeks, transferred over ~200GB of data. Before releasing to production (first days after release) some tuning is required. Setting render_concurrency
too high will end up with lower performance and rejected tasks.

Service running on our VPS should handle ~100 000 requests daily (the average article size). Please note, we were testing service against articles, not Books.
While generating only TOP 5 Printed articles service is able to handle ~58 000 requests daily. Our current traffic is ~60 thousands requests daily.

Setup:

Service is available at chromium-pdf.wmflabs.org - an virtualized server. VPS allocation:

  • 2x Intel [ Family:6, Model:61, Stepping:2, Broadwell arch] @ 2.3GHz
  • 4GB of RAM
  • 20GB Drive

Node service is running and managed via PM2. User: chromium with limited priviliges.

To test I used two tools:

  • siege
  • JMeter (both locally and a remote worker).

Most of the tests were executed on a remote server (Debian 9, no other jobs, ~150mbit connection), as an additional load I used my work laptop (Debian 9, 50mbit connection)

Catches:

When service is misconfigured:

  • allows more concurrent requests than CPUs can handle - service will start rejecting jobs (removed/not allowed to the queue)
  • the amount of memory is not enough to fit all concurrent requests - chromium instances start to fail which may lead to unexpected failure of main node process and a service crash
  • Service could have better performance if given more RAM, while the testing system was using somewhere between 75-90% of available memory.
  • when running lots of small jobs service renders most of the articles in less than 4-5 seconds and then a couple of the articles in over 9-10 seconds.
  • for short articles usually setting higher concurrency gives higher throughput (example concurrency 20 gives 3.29 transaction per second but the average render time jumps up to 5.8second). We can handle more renders but all users have to wait longer for the PDF.

Resource management:

The server is using ~3GB of RAM in total while load testing. When iddle memory consumption drops to 2.3GB. Rough estimate is ~100MB per 1 chromium instance (one request)
Chromium service does not use lots of I/O operations, up to 400blocks out when load testing.
While load testing CPU consumption stays 100% (load between 7-10, usually load reflects the number of concurrent requests)

Stress tests a.k.a benchmark:

Service starts to reject jobs when the queue is full (which is correct behavior). As an example, benchmark with up 100 concurrent requests, without
waiting between requests gives ~3% availability, 97% of requests are properly rejected due to full queue.

Load test:

Service was stress tested for approximately a week, the longest run was 72hours, 100%CPU Load (5 concurrent requests, a mix of long and short articles).
Service maintained 100% health and generated PDFs for >99% of requests (the failed one were rejected by the queue logic, not by system failure which is correct behavior).

Performance tests:

High traffic, TOP 5 Printed articles

Service is able to handle up to ~58 000 requests daily

Availability:                 100.00 %
Response time:                  7.19 secs
Longest transaction:           16.54 secs
Shortest transaction:           3.27 secs
Transaction rate:               0.67 trans/sec
Throughput:                     0.71 MB/sec
Concurrency:                    4.83 jobs
High traffic, short articles

Service is able to handle up to ~245 000 requests daily

Availability:                 100.00 %
Response time:                  3.25 secs
Longest transaction:            8.45 secs
Shortest transaction:           1.67 secs
Transaction rate:               2.86 trans/sec
Throughput:                     0.49 MB/sec
Concurrency:                    9.28 jobs
Average size articles with lots of math

Service is able to handle up to ~138 000 requests daily

Availability:              100.00 %
Response time:                5.98 secs
Longest transaction:           22.15 secs
Shortest transaction:            2.36 secs
Transaction rate:            1.60 trans/sec
Throughput:                0.63 MB/sec
Concurrency:                9.59 jobs
High traffic, very long articles (list taken from Special:LongPages + Obama), 7 concurrent requests

Service was not able to handle up to 10 concurrent requests, the queue started rejecting items.
Service is able to handle up to 7 concurrent requests still offering high availability (~99%). On 8 concurrent it started to reject jobs very often).
In total Chromium-pdf can render up to ~17 000 renders daily on heavy load

Availability:                  98.87 %
Response time:                 33.57 secs
Longest transaction:           60.23 secs (the timeout)
Shortest transaction:           0.21 secs (it's time it takes to generate "job rejected")
Transaction rate:               0.20 trans/sec
Throughput:                     0.48 MB/sec
Concurrency:                    6.91 jobs
Very long articles (up to 4 jobs).

Service handling 4 concurrent jobs is able to render up to ~18 000 articles daily and gives 100% success rate

Availability:                 100.00 %
Response time:                 18.95 secs
Longest transaction:           35.85 secs
Shortest transaction:           8.79 secs
Transaction rate:               0.21 trans/sec
Throughput:                     0.49 MB/sec
Concurrency:                    3.94 jobs

Long+Short test

Those tests were executed to find out how long articles rendering affects many short jobs
During tests, service was handling up to 8 concurrent requests for short articles and 2 concurrent requests for long articles
In total service is able to handle ~150 000 requests daily

Short Articles
Response time:                  4.43 secs
Transaction rate:               1.71 trans/sec
Longest transaction:           14.70 secs
Shortest transaction:           2.15 secs
Throughput:                     0.29 MB/sec
Concurrency:                    7.56 jobs
Long Articles
Response time:                 20.32 secs
Longest transaction:           36.36 secs
Shortest transaction:           9.18  secs
Transaction rate:               0.10 trans/sec
Throughput:                     0.22 MB/sec
Concurrency:                    1.98 jobs

Light load tests, concurrency the same as CPU cores amount

Those tests were executed to find out the how much time it takes to generate PDF when service is almost iddling.

Very long articles (list taken from Special:LongPages + Obama)
Availability:                 100.00 %
Response time:                 11.65 secs
Longest transaction:           20.12 secs
Shortest transaction:           7.50 secs
Concurrency:                    1.95 jobs
Short articles
Availability:                 100.00 %
Response time:                  2.16 secs
Longest transaction:            3.26 secs
Shortest transaction:           1.28 secs
Concurrency:                    1.79 jobs

@Fjalapeno, @mobrovac: See the above (T178278#3852240) for the results of the performance test. @pmiazga has a little follow-on work to do but it shouldn't affect the recommendation. Would pinging someone from Ops be appropriate here? If so, then who?

Generate PDFs using the list of articles above. Verify that the PDF contents look good, i.e. the PDFs contain actual article text, and not some error message from RESTBase or somewhere else. Upload the resulting PDFs here.

AFAICT @pmiazga generated a lot of PDFs during his various performance test runs. Nevertheless, @pmiazga: Do you have a couple of PDFs lying around?

@phuedx - yes, around 50GB. I'll attach some

phuedx added a subscriber: bmansurov.
phuedx removed a subscriber: bmansurov.
phuedx updated the task description. (Show Details)
phuedx added a subscriber: bmansurov.

^ Not sure what's going on there… Sorry, @bmansurov!

Additionally, I ran a production traffic against the chromium-pdf renderer service. I retrieved all web requests to the chromium service done on 10/04/2017.

Query

SELECT
    ts, uri_host, uri_path 
FROM 
    wmf.webrequest
WHERE 
    webrequest_source='text' AND
    agent_type='user' AND 
    year = 2017 AND 
    month = 10 AND
     day = 4 AND
    uri_path like '/api/rest_v1/page/pdf/%'

Query results and findings

The query returned an array of 84445 requests to 57819 unique articles across 240 wikis.
The highest concurrency was 173 print requests sent during one second, but I think the problem is somewhere else as 172 of those requests were for the same article.
The most printed article was printed 450 times (on a single day) but again I think it's a problem somewhere else as 172 requests were sent at the same second.
The second highest concurrency was 89 but again, 89 requests for the same article at the same second.
The average concurrency 1.68 requests per second, Median: 1 request per second.

If I remove duplicated requests (request the same article on the same second), requests count drops to 76256 requests ( 8189 duplicates ) max concurrency drops to 18 requests per second, with average 1.54 request per second.

Tests how-to

Using scripts I transformed the query output into a CSV file containing the full URL to the chromium service including wiki and the article. Then I created a JMeter test using HTTPSampler and CSVDataSet pointing to the CSV file.

Test results

I ran JMeter test 2 times, both at concurrency 5, which I found most efficient during performance testing.

  • First run: PDF service rendered all articles in 15:20.10 (15 hours, 20 seconds). 2 articles got a timeout
  • Second run: PDF Service rendered all articles in 14:46.54 (14 hours, 46 minutes), 1 article got timeout

Results

Looks like the current instance of Chromium-PDF is able to handle the production traffic. I have to take a closer look and verify why some articles are generated so often (8 articles get more than 100 print requests in ~5 seconds window)

I also should mention that I did a "live data" test, a set of scripts:

  • first script took the list of all requests for a single day and split that into set of smaller lists of all requests sent in given second (format HHMMSS.cvs)
  • queryChromium script that was picking the system time, loading the file HHMMSS.csv and sending simultaneous curl commands in background (curl $URL > /dev/null &) to retrieve all urls
  • job that was running for one day, every second running the queryChromium script

With that set of tools I was able to emulate the production traffic with very similar timing (with 1-second granularity). Chromium-PDF service running on VPS handled that test well. There were many rejects jobs due to fact that we have some strange data in the web_requests table (like 70 requests for the same resource, at the same second, same UA, same IP, different sequence_id). ~10 jobs failed due to render timeout, but again, it was related to strange web_requests data.