Page MenuHomePhabricator

Benchmark the new page summary API
Closed, ResolvedPublic

Description

We should benchmark the new page summary API to ensure that we're not introducing a severe performance regression in summary creation.

Due to ongoing code churn over various issues it sounds like we're pushing back the switchover until the week after Dev Summit/All Hands. This buys us some time to get a sense of these numbers after the code firms up.

Event Timeline

Mholloway created this task.
Mholloway added a subscriber: mobrovac.

We have a few scripts which go through the top pages of a few wikis. I think I modeled the measure-payloads.js script to use the Node Performance API but ended up mainly checking the payload sizes.

What I'm thinking is that I'll spin up a medium or large Cloud VPS instance with local RESTBase and MCS installations and benchmark for a test sample before and after using ab and parallel along the lines described here: https://www.simonholywell.com/post/2015/06/parallel-benchmark-many-urls-with-apachebench/

Does that seem like a reasonable plan?

A few questions:

  • Should we benchmark both endpoints with page HTML already cached in RB, or from a totally clean-slate DB? (Pre-cached page HTML would help isolate what we're interested in, I think.)
  • What sample size is sufficient? 100 pages? 1000?
  • What pages should be included in the sample?
    • Popular pages? These are likely to be much larger/more complex than average, but also probably more representative of the real workload
    • A random selection of pages?

Any other testing ideas (involving the existing page request scripts, or otherwise) that I should consider?

+ Performance team with the hope for some better ideas.

If the ab/parallel thing doesn't work out it should be easy to mangle on of the scripts we already have to add performance measurements. The difference is that our current scripts don't run things in parallel, and it might be harder to get statistical data.

The first big question is what/how we want to measure:
a) Measure the median time to generate the summary
b) Measure how many requests/s we can sustain

I haven't seen yet how the method mentioned in the articles accomplishes (b), nor do I think the scripts we have are meant for that. So, I think (a) would be simpler. I hope this is good enough.

I would start testing with a small sample size to verify your setup then go bigger.

  • Should we benchmark both endpoints with page HTML already cached in RB, or from a totally clean-slate DB? (Pre-cached page HTML would help isolate what we're interested in, I think.)

Yes, we want to isolate the Parsoid HTML generation from this since we're not testing Parsoid here. So, pre-cached Parsoid HTML yes, but let's avoid using the cache and RB storage for the summary endpoints. So, ideally for the setup you may want to pre-warm the Parsoid HTML cache locally.

We could potentially also compare the MW API call for TextExtracts and the MCS direct call. In this case you want to make sure the sample page list doesn't include any redirects because that's RB's concern.

  • What pages should be included in the sample?

We have several lists of top 1000 pages in the private/top-pages folder of our repo. I'd be ok with testing a couple of the languages available there.

To get the URLs.txt from the article, you could use a command similar to cat top-pages.en.json | jq -r '.items[] | [.title] | @tsv' to get it in for of one title per line. Or consider something like this for full urls:

cat top-pages.en.json| jq -r '.items[] | .url = "http://0.0.0.0:7231/en.wikipedia.org/v1/page/summary/\(.title)" | .url'

The revisions in the JSON files are old by now. So if you also want to specify the revisions in the URLs then I would recommend regenerating the top pages files you use. There's a script that generated this. But I think as long as you run the tests on the same day revisions shouldn't matter too much anyways.

Any other testing ideas (involving the existing page request scripts, or otherwise) that I should consider?

I'm mainly interested in the comparison of some median values (p50, p95 ?) of the old implementation going to TextExtracts MW API and the new one going to MCS. Probably need some large enough number of test runs to get meaningful data but not too big to cause issues. But, as I said above, start small to get a rough idea (we might have some more tweaks to our implementation coming), then use bigger numbers.

@Mholloway Not much to add from the perspective of the perf team. Your approach seems like it makes a lot of sense. @bearND got most of the additional comments that we would have made.

Mentioned in SAL (#wikimedia-cloud) [2018-02-01T15:13:31Z] <mdholloway> launched new instance page-summary-performance (T184751)

@bearND Yeah, I guess ab is more focused on load testing, which is a bit different than what we're after here.

I guess I could update or adapt measure-payloads.js to track response times, as you suggested, and that would be a better way to go. I'm a little confused about where the Node Performance API comes in, though. Isn't that more for benchmarking Node processes internally? (Actually, that probably would be a useful thing to do, just not helpful for comparaing against the current RB/TextExtracts summary generation, unless I'm missing something.)

If what you're looking to measure is concurrent performance -- that is, performance of a bunch of requests executing simultaneously -- ab is a perfectly reasonable choice for a tool.

I think since the tests run in a different environment and to get some meaningful reports we should get a comparison of whatever we're measuring with the MCS mobile-sections requests. That way we can better gauge how we roll this out (see current plan in T179875) or if we foresee any problems.

Background info: according to the MCS dashboard in Grafana we have usually fewer than 75 req/s. In the last big deploy event mid December it spiked to 125 req/s briefly. @Pchelolo tells me that SCB was fine with that.

@Imarlier Any guidance on what we should be measuring? Is measuring of latency enough or should we do actual throughput testing?

@Mholloway I don't believe that using the Node Performance API is required. I played with it a bit a while a go when I wrote the script I mentioned earlier but is still marked experimental and requires Node 8.5 for the minumum API. So, don't worry about that for this round. Whatever script you write for this I would like to seem the end up in Git so we can reuse them for future endpoints.

@Imarlier I should probably go into a little more background on what's going on here. Apologies if you know most or all of this already.

RESTBase is an API proxy, backed by a DB cache, that serves REST API endpoints including the page summary endpoint (/api/rest_v1/page/summary/{title}). Responses are stored in Cassandra for faster service; in production, many endpoints (including the page summary endpoint) have responses pre-generated via some script for that purpose; otherwise, in a dev environment, the response for a given title(/revision/timeuuid) is stored after the first request. (In production, updated responses are also subsequently regenerated and stored upon change events such as page edits.)

Page summary endpoint responses are currently composed in RESTBase itself from the response to an underlying MediaWiki API call. Our proposed change updates the response composition process to call on the mobile content service (mobileapps) to get the underlying info from the MediaWiki API and then do a substantial bit more processing before returning a response to RESTBase for storage. The task here is to ensure that the change doesn't introduce a substantial performance regression. But it's somewhat tricky to assess because it's only the first request for a given resource that counts; after that, RESTBase will simply serve the stored response, which we can assume will be fast. The scenario we're worried about is the performance for responding to a request for which a current response hasn't already been stored. (@mobrovac or @Pchelolo may have additional color and/or corrections here, but I think the Services team is on offsite this week.)

So what I think we really want to measure are the response times, before and after this change, in a scenario in which RESTBase does not yet have any stored page summary responses, because the summaries haven't been requested yet. For a given test run, I think we actually want to request each title in a large sample exactly once, not many times concurrently, because we're not worried about the performance of serving stored responses, only the performance of generating new ones.

Here's your numbers @Mholloway. Measured from within production environment from xenon.eqiad.wmnet:

RESTBase with serving from storage:

ab -c 10 -n 5000 http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/summary/Darth_Vader

Concurrency Level:      10
Time taken for tests:   4.833 seconds
Complete requests:      5000
Failed requests:        0
Total transferred:      17510000 bytes
HTML transferred:       12070000 bytes
Requests per second:    1034.53 [#/sec] (mean)
Time per request:       9.666 [ms] (mean)
Time per request:       0.967 [ms] (mean, across all concurrent requests)
Transfer rate:          3538.02 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   1.8      0      17
Processing:     4    9   4.9      8     117
Waiting:        4    9   4.9      8     115
Total:          4   10   5.8      8     117

Percentage of the requests served within a certain time (ms)
  50%      8
  66%     10
  75%     10
  80%     11
  90%     14
  95%     18
  98%     26
  99%     31
 100%    117 (longest request)

Current implementation in RESTBase with no storage:

ab -c 10 -n 1000 -H 'Cache-Control: no-cache' http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/summary/Darth_Vader

Concurrency Level:      10
Time taken for tests:   7.377 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      3502000 bytes
HTML transferred:       2414000 bytes
Requests per second:    135.55 [#/sec] (mean)
Time per request:       73.772 [ms] (mean)
Time per request:       7.377 [ms] (mean, across all concurrent requests)
Transfer rate:          463.58 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2  31.9      0    1009
Processing:    46   72  42.1     67    1097
Waiting:       46   72  42.1     67    1097
Total:         46   74  52.5     68    1101

Percentage of the requests served within a certain time (ms)
  50%     68
  66%     72
  75%     75
  80%     77
  90%     85
  95%     93
  98%    127
  99%    270
 100%   1101 (longest request)

MCS implemetation with no storage:

ab -c 10 -n 1000 http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/summary/Darth_Vader

Concurrency Level:      10
Time taken for tests:   40.715 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      3012000 bytes
HTML transferred:       2127000 bytes
Requests per second:    24.56 [#/sec] (mean)
Time per request:       407.150 [ms] (mean)
Time per request:       40.715 [ms] (mean, across all concurrent requests)
Transfer rate:          72.24 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   3.7      0      30
Processing:   272  403  88.1    392    1655
Waiting:      272  403  88.1    392    1655
Total:        272  405  88.9    393    1668

Percentage of the requests served within a certain time (ms)
  50%    393
  66%    414
  75%    430
  80%    442
  90%    473
  95%    511
  98%    574
  99%    643
 100%   1668 (longest request)

So it seems that MCS is 3 times slower then the previous implementation, but that doesn't really matter cause everything will be served to clients from the storage, so the 10 ms latency will remain.

@Pchelolo very nice. Can you add a link to the ones to mobile-section so we can compare the two endpoints? I think I'm mainly looking for MCS implemetation with no storage going to http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-sections/Darth_Vader and see how that compares to the new summary one. I basically want to know the relationship of latencies of summary vs mobile-sections.

@bearND sure!

ab -c 10 -n 1000 http://mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-sections/Darth_Vader

Concurrency Level:      10
Time taken for tests:   64.927 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      207527000 bytes
HTML transferred:       206631000 bytes
Requests per second:    15.40 [#/sec] (mean)
Time per request:       649.272 [ms] (mean)
Time per request:       64.927 [ms] (mean, across all concurrent requests)
Transfer rate:          3121.39 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2  32.0      0    1011
Processing:   473  644  70.6    637     958
Waiting:      470  638  69.9    632     955
Total:        473  646  77.1    639    1609

and the stored version:

ab -c 10 -n 1000 http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/mobile-sections/Darth_Vader

Concurrency Level:      10
Time taken for tests:   4.158 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      207753000 bytes
HTML transferred:       206631000 bytes
Requests per second:    240.50 [#/sec] (mean)
Time per request:       41.580 [ms] (mean)
Time per request:       4.158 [ms] (mean, across all concurrent requests)
Transfer rate:          48793.03 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   2.2      0      21
Processing:    21   40  15.6     37     191
Waiting:       18   33  12.7     31     188
Total:         21   41  16.6     38     195

Percentage of the requests served within a certain time (ms)
  50%     38
  66%     42
  75%     46
  80%     48
  90%     58
  95%     69
  98%     95
  99%    114
 100%    195 (longest request)

You have to take into account that the payload size is much bigger for mobile-sections, but overall it seems we're fine with new summary perf

Ok, just comparing the "MCS no storage" ones, I see the median (p50) of summary is about 62% of mobile-sections requests. The max (p100) is actually 103%, so slightly higher. I think the p100 case is probably an outlier. Unfortunately the percentile distribution for that run was not included. Would be good to look at some in the p9x, too. So, I think it should be less than doubling the load.

Thanks, @Pchelolo, that's awesome. I'm glad the new summary performance numbers are still in acceptable range (even for a big article), despite being quite a bit slower than the current implementation.

@Pchelolo Anything else needed from us, or can we close this out?

Pchelolo edited projects, added Services (done); removed Services (watching).

@Mholloway I believe we can close this one.

238482n375 lowered the priority of this task from High to Lowest.
238482n375 moved this task from Next Up to In Code Review on the Analytics-Kanban board.
238482n375 edited subscribers, added: 238482n375; removed: Aklapper.

SG9tZVBoYWJyaWNhdG9yCk5vIG1lc3NhZ2VzLiBObyBub3RpZmljYXRpb25zLgoKICAgIFNlYXJjaAoKQ3JlYXRlIFRhc2sKTWFuaXBoZXN0ClQxOTcyODEKRml4IGZhaWxpbmcgd2VicmVxdWVzdCBob3VycyAodXBsb2FkIGFuZCB0ZXh0IDIwMTgtMDYtMTQtMTEpCk9wZW4sIE5lZWRzIFRyaWFnZVB1YmxpYwoKICAgIEVkaXQgVGFzawogICAgRWRpdCBSZWxhdGVkIFRhc2tzLi4uCiAgICBFZGl0IFJlbGF0ZWQgT2JqZWN0cy4uLgogICAgUHJvdGVjdCBhcyBzZWN1cml0eSBpc3N1ZQoKICAgIE11dGUgTm90aWZpY2F0aW9ucwogICAgQXdhcmQgVG9rZW4KICAgIEZsYWcgRm9yIExhdGVyCgpUYWdzCgogICAgQW5hbHl0aWNzLUthbmJhbiAoSW4gUHJvZ3Jlc3MpCgpTdWJzY3JpYmVycwpBa2xhcHBlciwgSkFsbGVtYW5kb3UKQXNzaWduZWQgVG8KSkFsbGVtYW5kb3UKQXV0aG9yZWQgQnkKSkFsbGVtYW5kb3UsIEZyaSwgSnVuIDE1CkRlc2NyaXB0aW9uCgpPb3ppZSBqb2JzIGhhdmUgYmVlbiBmYWlsaW5nIGF0IGxlYXN0IGEgZmV3IHRpbWVzIGVhY2guIE1vcmUgaW52ZXN0aWdhdGlvbiBuZWVkZWQuCkpBbGxlbWFuZG91IGNyZWF0ZWQgdGhpcyB0YXNrLkZyaSwgSnVuIDE1LCA3OjIxIEFNCkhlcmFsZCBhZGRlZCBhIHN1YnNjcmliZXI6IEFrbGFwcGVyLiC3IFZpZXcgSGVyYWxkIFRyYW5zY3JpcHRGcmksIEp1biAxNSwgNzoyMSBBTQpKQWxsZW1hbmRvdSBjbGFpbWVkIHRoaXMgdGFzay5GcmksIEp1biAxNSwgNzoyMiBBTQpKQWxsZW1hbmRvdSB1cGRhdGVkIHRoZSB0YXNrIGRlc2NyaXB0aW9uLiAoU2hvdyBEZXRhaWxzKQpKQWxsZW1hbmRvdSBhZGRlZCBhIHByb2plY3Q6IEFuYWx5dGljcy1LYW5iYW4uCkpBbGxlbWFuZG91IG1vdmVkIHRoaXMgdGFzayBmcm9tIE5leHQgVXAgdG8gSW4gUHJvZ3Jlc3Mgb24gdGhlIEFuYWx5dGljcy1LYW5iYW4gYm9hcmQuCkNoYW5nZSBTdWJzY3JpYmVycwpDaGFuZ2UgUHJpb3JpdHkKQXNzaWduIC8gQ2xhaW0KTW92ZSBvbiBXb3JrYm9hcmQKQ2hhbmdlIFByb2plY3QgVGFncwpBbmFseXRpY3MtS2FuYmFuCtcKU2VjdXJpdHkK1wpXaWtpbWVkaWEtVkUtQ2FtcGFpZ25zIChTMi0yMDE4KQrXClNjYXAK1wpTY2FwIChTY2FwMy1BZG9wdGlvbi1QaGFzZTIpCtcKQWJ1c2VGaWx0ZXIK1wpEYXRhLXJlbGVhc2UK1wpIYXNodGFncwrXCkxhYnNEQi1BdWRpdG9yCtcKTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2kK1wpMYW5ndWFnZS0yMDE4LUFwci1KdW5lCtcKTGFuZ3VhZ2UtMjAxOC1KYW4tTWFyCtcKSEhWTQrXCkhBV2VsY29tZQrXCkJvbGQKSXRhbGljcwpNb25vc3BhY2VkCkxpbmsKQnVsbGV0ZWQgTGlzdApOdW1iZXJlZCBMaXN0CkNvZGUgQmxvY2sKUXVvdGUKVGFibGUKVXBsb2FkIEZpbGUKTWVtZQpQcmV2aWV3CkhlbHAKRnVsbHNjcmVlbiBNb2RlClBpbiBGb3JtIE9uIFNjcmVlbgoyMzg0ODJuMzc1IGFkZGVkIHByb2plY3RzOiBTZWN1cml0eSwgV2lraW1lZGlhLVZFLUNhbXBhaWducyAoUzItMjAxOCksIFNjYXAgKFNjYXAzLUFkb3B0aW9uLVBoYXNlMiksIEFidXNlRmlsdGVyLCBEYXRhLXJlbGVhc2UsIEhhc2h0YWdzLCBMYWJzREItQXVkaXRvciwgTGFkaWVzLVRoYXQtRk9TUy1NZWRpYVdpa2ksIExhbmd1YWdlLTIwMTgtQXByLUp1bmUsIExhbmd1YWdlLTIwMTgtSmFuLU1hciwgSEhWTSwgSEFXZWxjb21lLlBSRVZJRVcKMjM4NDgybjM3NSBtb3ZlZCB0aGlzIHRhc2sgZnJvbSBJbiBQcm9ncmVzcyB0byBJbiBDb2RlIFJldmlldyBvbiB0aGUgQW5hbHl0aWNzLUthbmJhbiBib2FyZC4KMjM4NDgybjM3NSByZW1vdmVkIEpBbGxlbWFuZG91IGFzIHRoZSBhc3NpZ25lZSBvZiB0aGlzIHRhc2suCjIzODQ4Mm4zNzUgdHJpYWdlZCB0aGlzIHRhc2sgYXMgTG93ZXN0IHByaW9yaXR5LgoyMzg0ODJuMzc1IHJlbW92ZWQgc3Vic2NyaWJlcnM6IEFrbGFwcGVyLCBKQWxsZW1hbmRvdS4KQ29udGVudCBsaWNlbnNlZCB1bmRlciBDcmVhdGl2ZSBDb21tb25zIEF0dHJpYnV0aW9uLVNoYXJlQWxpa2UgMy4wIChDQy1CWS1TQSkgdW5sZXNzIG90aGVyd2lzZSBub3RlZDsgY29kZSBsaWNlbnNlZCB1bmRlciBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSAoR1BMKSBvciBvdGhlciBvcGVuIHNvdXJjZSBsaWNlbnNlcy4gQnkgdXNpbmcgdGhpcyBzaXRlLCB5b3UgYWdyZWUgdG8gdGhlIFRlcm1zIG9mIFVzZSwgUHJpdmFjeSBQb2xpY3ksIGFuZCBDb2RlIG9mIENvbmR1Y3QuILcgV2lraW1lZGlhIEZvdW5kYXRpb24gtyBQcml2YWN5IFBvbGljeSC3IENvZGUgb2YgQ29uZHVjdCC3IFRlcm1zIG9mIFVzZSC3IERpc2NsYWltZXIgtyBDQy1CWS1TQSC3IEdQTApZb3VyIGJyb3dzZXIgdGltZXpvbmUgc2V0dGluZyBkaWZmZXJzIGZyb20gdGhlIHRpbWV6b25lIHNldHRpbmcgaW4geW91ciBwcm9maWxlLCBjbGljayB0byByZWNvbmNpbGUu