Performance test the service
Closed, ResolvedPublic8 Estimated Story Points
Actions

Assigned To

Authored By

	phuedx
	Oct 16 2017, 10:13 AM

Description

Update 2018/01/02: Findings

See T178278#3852240 for @pmiazga's initial report.

We now know that we're looking at ~1 month to get the headless Chromium based render service deployed. It might take longer to get it undeployed if it isn't a viable replacement for the Electron-based render service. If we have an understanding of how the new service performs ahead of time, then we might be able to save folk (including us) a lot of time.

Open Questions

What?

From the parent task:

In T176627#3680183, @phuedx wrote:

We should be providing median and 95/99th percentile timings for the service rendering S/M/L/XL/XL… pages and resource consumption on the server during those test runs. Moreover, these tests should be repeatable (i.e. there's a script that we can run by anyone [and at different times!]).

We all acknowledge that we'd have to take the results with a grain of salt as we'd be using a VPS-hosted instance but they'll be helpful in the interim while we're trying to get the service into production.

In T176627#3684846, @GWicke wrote:

It might be worth focusing more on robustness than simple-page latency, as that is the more critical issue with Electron. Previously, I tested with a few very large articles (see T142226#2537844). This tested timeout enforcement. Testing with a simulated overload (many concurrent requests for huge pages) could also be useful to ensure that concurrency limits and resource usage limits are thoroughly enforced.

Where?

Performance testing on a VPS isn't ideal but it's cheaper than testing on production hardware (the time cost of deploying the service is ~1 month). @bmansurov has already set up a VPS to test the service, which is accessible here: http://chromium-pdf.wmflabs.org.

Per T178278#3726445, we also have access to bare metal with a big pipe. @pmiazga would be required to set up the service on that server.

List of test articles

Things to double check

During T178501#3767355 @pmiazga found out that some rendered PDFs are incomplete. This happens only when the queue is full and service is constantly handling at least 2-3 concurrent requests at once. During performance testing please verify that all PDF files are rendered correctly and contain all pages.

A/C

Generate PDFs using the list of articles above. Verify that the PDF contents look good, i.e. the PDFs contain actual article text, and not some error message from RESTBase or somewhere else. Upload the resulting PDFs here.

See T178278#3854820 onwards.

Measure and report times spent rendering articles in succession. The service logs contain this information.
Use siege to report the performance of the service. Create various scenarios where you have a combination of articles (from above) and control the number of concurrent requests. Analyze the service logs to make sure that nothing unexpected happens. For example, if the concurrency in siege is set as 5 and it's set in service as 3, then make sure that after the 3rd request, the next two are aborted immediately.
Monitory the system load while doing these tests. See T175853#3616304 for reference.

Related Objects
Search...

Status	Subtype	Assigned	Task
Open		None	T158181 Aim for workflow equivalence for MediaWiki on desktop and mobile web
Resolved		• mobrovac	T159922 pdfrender fails to serve requests since Mar 8 00:30:32 UTC on scb1003
Invalid		None	T172815 Improve stability and maintainability of our browser-based PDF render service
Resolved		ovasileva	T181079 [GOAL] Provide an expanded reading experience by improving the ways that users can download articles of interest for later consumption
Resolved		None	T181084 [EPIC] Deploy the mediawiki-services-chromium-render service (Proton)
Resolved		phuedx	T181118 [EPIC] Build a Chromium-based PDF renderer service
Resolved		phuedx	T180626 [Spike 8hr] How should we limit resources used by chromium render service?
Resolved		ovasileva	T163472 [EPIC] Provide a way to download articles in PDF on the mobile website
Resolved		ABorbaWMF	T174105 Update article actions icons to match wikimedia styleguide
Resolved		ABorbaWMF	T177215 Build download button for mobile PDF download
Declined		None	T177217 Display error message and instrument when PDF download fails on mobile
Declined		ovasileva	T177219 [Spike 3hr] Determine how to store PDFs in order to get file size
Declined		None	T177379 [Spike] Design interface for the new renderer service
Resolved		ABorbaWMF	T182059 Tapping on Download icon has no response in older versions of the Chrome browser including Samsung "Internet" browser.
Resolved	Spike	phuedx	T182197 [Spike 4hrs] Is there a way to detect the which browsers the download button is being delivered to?
Invalid		None	T181513 Prepare for deploy of chromium rendering service and usage on mobile (traffic)
Declined		None	T162414 User should be alerted when lazy loaded images haven't loaded in printed version of articles
Open		None	T201954 Add the print button to mobile browsers other than Chrome
Resolved		ovasileva	T196159 Remove instrumentation for Schema:Print
Resolved		• Tbayer	T179915 Determine expected amount of usage of mobile print to PDF button per browser
Resolved		• Tbayer	T179914 Deploy print to PDF button for Chrome on Android
Resolved		ovasileva	T179529 [Spike] Can we detect browsers where the window.print function simply doesn't work?
Resolved		CKoerner_WMF	T180114 Notify community of upcoming PDF button deployment
Resolved		• Tbayer	T181297 Instrument print to PDF button
Resolved		ovasileva	T208454 Disable the Schema:Print data collection
Resolved		Jdlrobson	T186176 [Bug] URI decoding fails for unescaped percent
Resolved		ABorbaWMF	T181680 Allow rendering PDFs for mobile
Resolved		phuedx	T178278 Performance test the service

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ovasileva mentioned this in T178665: [Spike, 8hrs] Grave kerning issues and spacing issues in PDFs generated by Chromium (and previous Electron) via "Download as PDF".Nov 3 2017, 4:46 PM

phuedx mentioned this in T178189: [spike] Temporarily allow pushing large objects.Nov 7 2017, 12:44 PM

phuedx updated the task description. (Show Details)Nov 7 2017, 12:46 PM

A few extra things we should take into consideration:

Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc
Books rendering - are there any other variables we're concerned with for books outside of length?

ovasileva moved this task from Blocked on Others to To Do on the Readers-Web-Kanbanana-Board-Old board.Nov 9 2017, 10:25 AM

In T178278#3740910, @ovasileva wrote:

A few extra things we should take into consideration:

Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc

I think it's important that we test this, but would we call it a performance test? Isn't this task specifically about testing if the service can handle heavy traffic?

Books rendering - are there any other variables we're concerned with for books outside of length?

As it currently stands, the service is configured to handle single articles (timeouts are not big enough for books to render correctly). I wonder if this should be the next step only if the performance test of single articles passes.

Also we'll need a list of articles to test against. Do we have one?

phuedx removed a parent task: T176627: Trial replacing Electron with headless Chromium in the render service.Nov 22 2017, 8:24 AM

phuedx added a parent task: T181118: [EPIC] Build a Chromium-based PDF renderer service.

In T178278#3779400, @bmansurov wrote:

In T178278#3740910, @ovasileva wrote:

A few extra things we should take into consideration:

Mobile PDFs and rendering mobile pdfs - traffic, delivering separate print styles, etc

I think it's important that we test this, but would we call it a performance test? Isn't this task specifically about testing if the service can handle heavy traffic?

This would be about heavy traffic and an extra check that we can support two different styles based on the browser (I think the latter might require its own task)

Books rendering - are there any other variables we're concerned with for books outside of length?

As it currently stands, the service is configured to handle single articles (timeouts are not big enough for books to render correctly). I wonder if this should be the next step only if the performance test of single articles passes.

Sounds good

Also we'll need a list of articles to test against. Do we have one?

I can put together a list. My thoughts would be:

5-6 very long articles
5-6 article with a lot of images/infoboxes
5-6 very short articles
the top 5-6 printed articles

OK, so it looks like the plan is to create two more tasks: (1) test the appearance of articles with and without mobile styles, (2) test how the server handles books (which should be done only after the service is ready for testing).

In addition to the article types you listed, I'd throw in:

articles in RTL languages
a minimalistic article that only has a couple of words (or no content) so that we can measure the base case
Look at current electron logs and identify the most frequently printed articles, and test them too.

ovasileva updated the task description. (Show Details)Nov 23 2017, 1:32 PM

In T178278#3781319, @bmansurov wrote:

OK, so it looks like the plan is to create two more tasks: (1) test the appearance of articles with and without mobile styles

I think we would want to test three things:

whether the service can simultaneously produce both desktop and mobile styles
whether chromium can support and create PDFs with mobile styles (not sure if we need to do this as we're currently serving the mobile styles on google chrome, but it's probably better to double-check)
the performance of the service when mobile styles are used (with the set of articles above)

Does that sound right? I can set up the task.

• Elitre subscribed.Nov 24 2017, 2:00 PM

phuedx mentioned this in T178501: Limit resources used by Chromium in order to make the chromium-render service ready for production.Nov 27 2017, 11:35 AM

pmiazga updated the task description. (Show Details)Nov 27 2017, 1:37 PM

We also need to analyze access logs from electron service and try to replay the real traffic. Synthetic tests are not the best pick when it comes to performance testing.

@ovasileva yes, that sounds right.
@pmiazga, do you think that testing the service with real traffic is part of this task, or is it something we need to do after working on this task?

IMHO this is a part of this task, as we want to verify that the newly created service is good enough to handle the production traffic.

In T178278#3788968, @bmansurov wrote:

@ovasileva yes, that sounds right.

Created T181513: Prepare for deploy of chromium rendering service and usage on mobile (traffic) - @bmansurov, @pmiazga - could you take a look?

Also, would we be able to test books using the current concatenation code prior to deploying the service (but after this initial round of testing)?

• bmansurov mentioned this in T181513: Prepare for deploy of chromium rendering service and usage on mobile (traffic).Nov 28 2017, 3:36 PM

I've left a comment on T181513. As for testing books, yes we need to find a way to either test the service using Extension:Collection or something else. But remember, the service's current goal is to replace Electron as is. We haven't thought about books in this context yet as I mentioned in T178278#3779400.

• Imarlier subscribed.Nov 29 2017, 6:37 PM

• bmansurov updated the task description. (Show Details)Nov 29 2017, 7:40 PM

@bmansurov - during the testing, will we have PDF samples. I.e., will we be able to confirm whether T178665: [Spike, 8hrs] Grave kerning issues and spacing issues in PDFs generated by Chromium (and previous Electron) via "Download as PDF" is an issue in Chromium as well?

@ovasileva yes, I've updated the description. That said, we should stay focused and test the performance of the service not other things.

In T178278#3799545, @bmansurov wrote:

@ovasileva yes, I've updated the description. That said, we should stay focused and test the performance of the service not other things.

Thank you! And agreed - this would be just a quick check - we'll look further in separate tasks if necessary.

@pmiazga @Jdlrobson since the blocker (T181623) is blocked, I think we can move ahead with this task now.

ovasileva moved this task from To Do to Needs Design Review on the Readers-Web-Kanbanana-Board-Old board.Dec 1 2017, 11:26 AM

phuedx mentioned this in T180601: Add logging to the mediawiki-services-chromium-render service.Dec 1 2017, 1:53 PM

ovasileva moved this task from Triage to Current Sprint on the Proton board.Dec 1 2017, 2:23 PM

I've updated the chromium-pdf.reading-web-staging.eqiad.wmflabs server with the latest chromium-render code.

Yup, I'll start working on it today.

pmiazga claimed this task.Dec 1 2017, 5:55 PM

Also note that pm2 is keeping the service alive.

@pmiazga: Should this be in Doing?

pmiazga moved this task from Needs Design Review to Doing on the Readers-Web-Kanbanana-Board-Old board.Dec 4 2017, 1:20 PM

• bmansurov mentioned this in T180626: [Spike 8hr] How should we limit resources used by chromium render service?.Dec 6 2017, 7:22 PM

• bmansurov added a subtask: T180626: [Spike 8hr] How should we limit resources used by chromium render service?.

phuedx removed a subtask: T180626: [Spike 8hr] How should we limit resources used by chromium render service?.Dec 7 2017, 11:30 AM

phuedx added a parent task: T180626: [Spike 8hr] How should we limit resources used by chromium render service?.

phuedx added a parent task: T181680: Allow rendering PDFs for mobile.Dec 7 2017, 6:11 PM

Testing round one.

Type: synthetic test, set of 50 urls, no pauses. I focused mostly on concurrency/response rates, not on server load as the chromium renderer was a bit outdated. It was more a "test round".

Time taken: 12 hours 30 minutes
Concurrency: 5-7 concurrent requests at once
Requests sent: 29738
Success rate: 100%
Error rate: 0%
Throughput: ~ 39 req/min
Data rate: 865.47KB/sec
Stored: 29GB of PDF files

Response rates

Average response rate: 9.815sec
Median: 7.193sec
Fastest response: 1.872sec

99% requests took less than 25.241 seconds to finish
95% requests took less than 21.439 seconds to finish
90% requests took less than 19.055 seconds to finish

Server load

CPU: 2 threads,100%
Memory consumption - less than 3.5GB

Testing environemnt:

Two machines

main device - laptop, i7 + 16GB of RAM, SSD drive (for storing data), using JMeter on 50MBit connection
additional load - virtualized server on bare metal, 2xXeon L5335, 8GB of RAM, ~150Mbit conection

Other findings:

I used a set of 50 articles, what means I should have ~50 sets of the same files. System stored 22465 files, there are 1327 distinct file sizes. Some PDFs are broken, some are not rendered completely. I'll analyze PDF files and submit a separate patch to verify/fix broken PDFs

Thnx @pmiazga for this first round. One thing that could be also interesting to measure is how is the response latency influenced buy heavy and/or long-running tasks. For example, what is the mean/average response time when smaller/average-sized articles are being rendered if long-running tasks are run in parallel or not. Basically, the idea would be to measure whether continuously rendering something like Barack Obama influences rendering times of other articles.

As for the PDF files themselves, the quality of the files should probably be inspected separately. When it comes to performance testing, accounting for storage is not relevant and can easily slow things down (especially if the system has to write 22k+ files).

@mobrovac I stored files as I want to verify that files are rendered correctly. Transfer rate is less than 1MB/s and my SSD drive can write up 250MB/s, ~29GB over 12hours is not that much, it should not slow down the load testing overall. I'll definitely have couple rounds where I do not store data (typical performance/benchmark test), but for the regular load IMHO it doesn't hurt.
I like the idea of having couple long-running tasks in the background. I'll definitely do that.

@bd808 -FYI, we're testing the newly created chromium service (http://chromium-pdf.wmflabs.org), don't get alarmed when this service gets flooded with requests or constantly uses all resources (cpu&mem).

@Tbayer: @pmiazga and I had discussed replaying a day's worth of logs from the wmf_raw.webrequest table against the service (obviously, we'd only consider requests to /api/rest_v1/page/pdf/…). In order to do this, @pmiazga would have to copy the logs to his development machine.

My understanding is that that table contains potentially sensitive data and we should ask permission before making a copy of the logs. Is that correct? If so, who do we ask?

We also discussed only exporting the timestamp and the URL, which seems to minimize the amount of sensitive data that we export.

For next 3-5 days I'll keep sending 5-10 concurrent requests all the time plus from time to time I'll generate PDF to verify everything is still correct. I also need to prepare couple different testing scenarios (like generate set of "only non-latic languages").

So far avg response time is 10 seconds, service is able to handle ~28 requests per minute, service did not collapse so far, the only issue is that some PDFs are not rendered correctly.

Now I don't need to spend the whole day on this task, only around 2-3 hours daily to verify the results (I configured one server to constantly ask chromium service to render some articles) , I'll keep this task in doing and start working/reviewing other tasks.

MarkAHershberger subscribed.Dec 14 2017, 3:22 PM

In T178278#3831062, @phuedx wrote:

@Tbayer: @pmiazga and I had discussed replaying a day's worth of logs from the wmf_raw.webrequest table against the service (obviously, we'd only consider requests to /api/rest_v1/page/pdf/…). In order to do this, @pmiazga would have to copy the logs to his development machine.

My understanding is that that table contains potentially sensitive data and we should ask permission before making a copy of the logs. Is that correct? If so, who do we ask?

We also discussed only exporting the timestamp and the URL, which seems to minimize the amount of sensitive data that we export.

Sorry, only seeing this now (@pmiazga and I had earlier discussed the use of webrequest data for this purpose on IRC, but not this particular question).

Yes, the webrequest data contains sensitive information, hence it is kept private and is only accessible to people under NDA who in addition have the required user rights. IANAL, but it seems common sense to avoid permanently storing the full logs on a personal development machine. Exporting only timestamp and URL sounds like a great idea.

By the way, any particular reason for using wmf_raw.webrequest instead of the more commonly used wmf.webrequest? (cf. documentation)

ovasileva mentioned this in T183161: Performance test books on chromium rendering service.Dec 19 2017, 5:44 PM

In T178278#3848870, @Tbayer wrote:

By the way, any particular reason for using wmf_raw.webrequest instead of the more commonly used wmf.webrequest? (cf. documentation)

A typo on my part. If we're exporting timestamps and URLs, then wmf.webrequest makes perfect sense.

TL;DR;

Service is performant and stable enough to be put on production. The most performant settings are to set up a number of concurrent renders to {CPU_CORES_COUNT} * 2 which
offers best rendering-to-waiting time. In comparison, when in idle state the service can generate short PDF in ~2secs and under high load same requests would take 50% more time.
I tested service for ~2 weeks, transferred over ~200GB of data. Before releasing to production (first days after release) some tuning is required. Setting render_concurrency
too high will end up with lower performance and rejected tasks.

Service running on our VPS should handle ~100 000 requests daily (the average article size). Please note, we were testing service against articles, not Books.
While generating only TOP 5 Printed articles service is able to handle ~58 000 requests daily. Our current traffic is ~60 thousands requests daily.

Setup:

Service is available at chromium-pdf.wmflabs.org - an virtualized server. VPS allocation:

2x Intel [ Family:6, Model:61, Stepping:2, Broadwell arch] @ 2.3GHz
4GB of RAM
20GB Drive

Node service is running and managed via PM2. User: chromium with limited priviliges.

To test I used two tools:

siege
JMeter (both locally and a remote worker).

Most of the tests were executed on a remote server (Debian 9, no other jobs, ~150mbit connection), as an additional load I used my work laptop (Debian 9, 50mbit connection)

Catches:

When service is misconfigured:

allows more concurrent requests than CPUs can handle - service will start rejecting jobs (removed/not allowed to the queue)
the amount of memory is not enough to fit all concurrent requests - chromium instances start to fail which may lead to unexpected failure of main node process and a service crash
Service could have better performance if given more RAM, while the testing system was using somewhere between 75-90% of available memory.
when running lots of small jobs service renders most of the articles in less than 4-5 seconds and then a couple of the articles in over 9-10 seconds.
for short articles usually setting higher concurrency gives higher throughput (example concurrency 20 gives 3.29 transaction per second but the average render time jumps up to 5.8second). We can handle more renders but all users have to wait longer for the PDF.

Resource management:

The server is using ~3GB of RAM in total while load testing. When iddle memory consumption drops to 2.3GB. Rough estimate is ~100MB per 1 chromium instance (one request)
Chromium service does not use lots of I/O operations, up to 400blocks out when load testing.
While load testing CPU consumption stays 100% (load between 7-10, usually load reflects the number of concurrent requests)

Stress tests a.k.a benchmark:

Service starts to reject jobs when the queue is full (which is correct behavior). As an example, benchmark with up 100 concurrent requests, without
waiting between requests gives ~3% availability, 97% of requests are properly rejected due to full queue.

Load test:

Service was stress tested for approximately a week, the longest run was 72hours, 100%CPU Load (5 concurrent requests, a mix of long and short articles).
Service maintained 100% health and generated PDFs for >99% of requests (the failed one were rejected by the queue logic, not by system failure which is correct behavior).

Performance tests:

High traffic, TOP 5 Printed articles

Service is able to handle up to ~58 000 requests daily

Availability:                 100.00 %
Response time:                  7.19 secs
Longest transaction:           16.54 secs
Shortest transaction:           3.27 secs
Transaction rate:               0.67 trans/sec
Throughput:                     0.71 MB/sec
Concurrency:                    4.83 jobs

High traffic, short articles

Service is able to handle up to ~245 000 requests daily

Availability:                 100.00 %
Response time:                  3.25 secs
Longest transaction:            8.45 secs
Shortest transaction:           1.67 secs
Transaction rate:               2.86 trans/sec
Throughput:                     0.49 MB/sec
Concurrency:                    9.28 jobs

Average size articles with lots of math

Service is able to handle up to ~138 000 requests daily

Availability:              100.00 %
Response time:                5.98 secs
Longest transaction:           22.15 secs
Shortest transaction:            2.36 secs
Transaction rate:            1.60 trans/sec
Throughput:                0.63 MB/sec
Concurrency:                9.59 jobs

High traffic, very long articles (list taken from Special:LongPages + Obama), 7 concurrent requests

Service was not able to handle up to 10 concurrent requests, the queue started rejecting items.
Service is able to handle up to 7 concurrent requests still offering high availability (~99%). On 8 concurrent it started to reject jobs very often).
In total Chromium-pdf can render up to ~17 000 renders daily on heavy load

Availability:                  98.87 %
Response time:                 33.57 secs
Longest transaction:           60.23 secs (the timeout)
Shortest transaction:           0.21 secs (it's time it takes to generate "job rejected")
Transaction rate:               0.20 trans/sec
Throughput:                     0.48 MB/sec
Concurrency:                    6.91 jobs

Very long articles (up to 4 jobs).

Service handling 4 concurrent jobs is able to render up to ~18 000 articles daily and gives 100% success rate

Availability:                 100.00 %
Response time:                 18.95 secs
Longest transaction:           35.85 secs
Shortest transaction:           8.79 secs
Transaction rate:               0.21 trans/sec
Throughput:                     0.49 MB/sec
Concurrency:                    3.94 jobs

Long+Short test

Those tests were executed to find out how long articles rendering affects many short jobs
During tests, service was handling up to 8 concurrent requests for short articles and 2 concurrent requests for long articles
In total service is able to handle ~150 000 requests daily

Short Articles

Response time:                  4.43 secs
Transaction rate:               1.71 trans/sec
Longest transaction:           14.70 secs
Shortest transaction:           2.15 secs
Throughput:                     0.29 MB/sec
Concurrency:                    7.56 jobs

Long Articles

Response time:                 20.32 secs
Longest transaction:           36.36 secs
Shortest transaction:           9.18  secs
Transaction rate:               0.10 trans/sec
Throughput:                     0.22 MB/sec
Concurrency:                    1.98 jobs

Light load tests, concurrency the same as CPU cores amount

Those tests were executed to find out the how much time it takes to generate PDF when service is almost iddling.

Very long articles (list taken from Special:LongPages + Obama)

Availability:                 100.00 %
Response time:                 11.65 secs
Longest transaction:           20.12 secs
Shortest transaction:           7.50 secs
Concurrency:                    1.95 jobs

Short articles

Availability:                 100.00 %
Response time:                  2.16 secs
Longest transaction:            3.26 secs
Shortest transaction:           1.28 secs
Concurrency:                    1.79 jobs

pmiazga moved this task from Doing to Ready for Signoff on the Readers-Web-Kanbanana-Board-Old board.Dec 20 2017, 7:25 PM

pmiazga removed pmiazga as the assignee of this task.Dec 20 2017, 7:30 PM

• Elitre unsubscribed.Dec 21 2017, 12:34 PM

phuedx claimed this task.Dec 21 2017, 2:22 PM

Good job on the test, @pmiazga!

phuedx awarded a token.Dec 21 2017, 4:03 PM

@Fjalapeno, @mobrovac: See the above (T178278#3852240) for the results of the performance test. @pmiazga has a little follow-on work to do but it shouldn't affect the recommendation. Would pinging someone from Ops be appropriate here? If so, then who?

phuedx updated the task description. (Show Details)Dec 21 2017, 6:29 PM

Generate PDFs using the list of articles above. Verify that the PDF contents look good, i.e. the PDFs contain actual article text, and not some error message from RESTBase or somewhere else. Upload the resulting PDFs here.

AFAICT @pmiazga generated a lot of PDFs during his various performance test runs. Nevertheless, @pmiazga: Do you have a couple of PDFs lying around?

@phuedx - yes, around 50GB. I'll attach some

3.pdf2 MBDownload

4.pdf1 MBDownload

2.pdf911 KBDownload

1.pdf615 KBDownload

• bmansurov unsubscribed.Dec 22 2017, 9:47 PM

phuedx updated the task description. (Show Details)Jan 2 2018, 11:29 AM

phuedx added a subscriber: • bmansurov.

phuedx removed a subscriber: • bmansurov.

^ Not sure what's going on there… Sorry, @bmansurov!

Additionally, I ran a production traffic against the chromium-pdf renderer service. I retrieved all web requests to the chromium service done on 10/04/2017.

Query

SELECT
    ts, uri_host, uri_path 
FROM 
    wmf.webrequest
WHERE 
    webrequest_source='text' AND
    agent_type='user' AND 
    year = 2017 AND 
    month = 10 AND
     day = 4 AND
    uri_path like '/api/rest_v1/page/pdf/%'

Query results and findings

The query returned an array of 84445 requests to 57819 unique articles across 240 wikis.
The highest concurrency was 173 print requests sent during one second, but I think the problem is somewhere else as 172 of those requests were for the same article.
The most printed article was printed 450 times (on a single day) but again I think it's a problem somewhere else as 172 requests were sent at the same second.
The second highest concurrency was 89 but again, 89 requests for the same article at the same second.
The average concurrency 1.68 requests per second, Median: 1 request per second.

If I remove duplicated requests (request the same article on the same second), requests count drops to 76256 requests ( 8189 duplicates ) max concurrency drops to 18 requests per second, with average 1.54 request per second.

Tests how-to

Using scripts I transformed the query output into a CSV file containing the full URL to the chromium service including wiki and the article. Then I created a JMeter test using HTTPSampler and CSVDataSet pointing to the CSV file.

Test results

I ran JMeter test 2 times, both at concurrency 5, which I found most efficient during performance testing.

First run: PDF service rendered all articles in 15:20.10 (15 hours, 20 seconds). 2 articles got a timeout
Second run: PDF Service rendered all articles in 14:46.54 (14 hours, 46 minutes), 1 article got timeout

Results

Looks like the current instance of Chromium-PDF is able to handle the production traffic. I have to take a closer look and verify why some articles are generated so often (8 articles get more than 100 print requests in ~5 seconds window)

Peter subscribed.Jan 12 2018, 8:22 AM

Joe added a project: User-Joe.Feb 16 2018, 3:49 PM

Joe subscribed.

I also should mention that I did a "live data" test, a set of scripts:

first script took the list of all requests for a single day and split that into set of smaller lists of all requests sent in given second (format HHMMSS.cvs)
queryChromium script that was picking the system time, loading the file HHMMSS.csv and sending simultaneous curl commands in background (curl $URL > /dev/null &) to retrieve all urls
job that was running for one day, every second running the queryChromium script

With that set of tools I was able to emulate the production traffic with very similar timing (with 1-second granularity). Chromium-PDF service running on VPS handled that test well. There were many rejects jobs due to fact that we have some strange data in the web_requests table (like 70 requests for the same resource, at the same second, same UA, same IP, different sequence_id). ~10 jobs failed due to render timeout, but again, it was related to strange web_requests data.

• Niedzielski subscribed.Feb 16 2018, 5:41 PM

• Niedzielski mentioned this in T187821: Choose a server for the chromium-render service.Feb 20 2018, 6:58 PM

Pcj subscribed.Feb 21 2018, 7:29 PM

• Phabricator_maintenance moved this task from 2017-18 Q2 to 2017-18 Q3 on the Web-Team-Backlog board.Mar 7 2018, 4:43 AM

	F11978739: 4.pdf
	Dec 21 2017, 9:25 PM

	F11978740: 3.pdf
	Dec 21 2017, 9:25 PM

	F11978738: 2.pdf
	Dec 21 2017, 9:25 PM

	F11978737: 1.pdf
	Dec 21 2017, 9:25 PM

Performance test the serviceClosed, ResolvedPublic8 Estimated Story PointsActions