Page MenuHomePhabricator

restbase 40% slower than old action api
Closed, InvalidPublic

Description

This is having into account that:

  • Mediawiki is currently serving almost all of the uncached text/api requests
  • on 3-5 year old servers
  • and not having precached all the pages
  • The main purpose of restbase is serving "low latency" simple requests, specially for parsoid and flow

Full results:

https://docs.google.com/spreadsheets/d/1Gtw_i2ZzzuGreElNNRxyxExv4G25sEkonLle7Kj_8ZU/edit?usp=sharing

run:

run1:
for i in `seq 1 100`; do line=$(( ( `od -A n -t d -N 4 /dev/urandom` + 2147483648 ) % 12105173)); title=`tail -n+$line /tmp/.fr-nmoWUj/enwiki-20151201-all-titles-in-ns0 | head -n1`; echo $title; TIME1=`date +%s.%N`; curl -s "https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=$title&format=json" | wc -c; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; TIME1=`date +%s.%N`; curl -s "https://rest.wikimedia.org/en.wikipedia.org/v1/page/html/$title" | wc -c; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; sleep 1; done > ~/workbook/comparison1.txt

run2:
jynus@sangai:~/mysql-server$ for i in `seq 1 100`; do line=$(( ( `od -A n -t d -N 4 /dev/urandom` + 2147483648 ) % 12105173)); title=`tail -n+$line /tmp/.fr-nmoWUj/enwiki-20151201-all-titles-in-ns0 | head -n1`; echo $title; TIME1=`date +%s.%N`; curl -s "https://rest.wikimedia.org/en.wikipedia.org/v1/page/html/$title" | wc -c; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; TIME1=`date +%s.%N`; curl -s "https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=$title&format=json" | wc -c; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; sleep 1; done > ~/workbook/comparison2.txt

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: jcrespo.

I made sure varnish wasn't interacting in both services by checking it was a miss/pass on HTTP headers.

Very nice comparison, @jcrespo, thank you!

I have yet to confirm this, but I suspect this is the downside of our current hack for T120409: RESTBase should honor wiki-wide deletion/suppression of users. For each request, while retrieving the title, we issue a request to the MW API to check for permissions. However, once we have page encapsulation in RESTBase (cf T106455: Add a first-class representation of pages (title-associated information) in restbase), we should be able to eliminate this call and rely solely on updates.

I ran the following command on tin:

for i in `seq 1 100`; do line=$(( ( `od -A n -t d -N 4 /dev/urandom` + 2147483648 ) % 12105173)); title=`tail -n+$line enwiki-20160113-all-titles-in-ns0 | head -n1`; echo -n "$title "; echo -n ''; TIME1=`date +%s.%N`; echo -n "`curl -s "https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=$title&format=json" | wc -c` "; TIME2=`date +%s.%N`; echo -n "`echo "$TIME2 - $TIME1" | bc -l` "; TIME1=`date +%s.%N`; echo -n "`curl -s "https://en.wikipedia.org/api/rest_v1/page/html/$title" | wc -c` "; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; sleep 1; done > comparison1.txt

Note that both are using en.wikipedia.org as the domain, and the test is run from the same DC to avoid bottlenecking on the network. The other changes are just to print each measurement as a space-separated line, for easier import into a spreadsheet.

Results at https://docs.google.com/spreadsheets/d/1Hd9Ln6ncRNqTNRtj-AXUZyAK214luYIUi3eQr1ZjRtQ/edit#gid=1257585686

Summary: Mean latency for the PHP API is about 0.12s, while RB's is 0.062, so about 1/2 of the PHP API. This fits pretty well with the data we are getting from production: https://grafana-admin.wikimedia.org/dashboard/db/api-summary?panelId=2&fullscreen

Please run it from outside the cluster, I get reproducible results from 2 different continents. 99.999% of our users live outside of the internal cluster.

For each request, while retrieving the title, we issue a request to the MW API to check for permissions.

Mean latency for the PHP API is about 0.12s, while RB's is 0.062

These two sentences are incompatible?

@jcrespo, you are comparing apples & oranges:

  1. You connect to en.wikipedia.org in one case, but use the deprecated & likely not geo-located rest.wikimedia.org in the other case. This matters, especially for TLS connection setup.
  2. Parsoid HTML is larger, as it contains more semantic information (data-mw). Since we are interested in API latency and not network transfer times, please run your tests from a place that's close to the API.

@mobrovac, the user suppression thing is only about external requests for revision metadata (which contains usernames). It does not affect any other end points, nor internal requests for stored revision information.

Just for fun, I also re-ran the tests directly against the APIs themselves, bypassing the NGINX + Varnish layers altogether.

Commandline:

for i in `seq 1 100`; do line=$(( ( `od -A n -t d -N 4 /dev/urandom` + 2147483648 ) % 12105173)); title=`tail -n+$line enwiki-20160113-all-titles-in-ns0 | head -n1`; echo -n "$title "; echo -n ''; TIME1=`date +%s.%N`; echo -n "`curl -H 'Host: en.wikipedia.org' -s "http://api.svc.eqiad.wmnet/w/api.php?action=parse&prop=text&page=$title&format=json" | wc -c` "; TIME2=`date +%s.%N`; echo -n "`echo "$TIME2 - $TIME1" | bc -l` "; TIME1=`date +%s.%N`; echo -n "`curl -s "http://restbase.svc.eqiad.wmnet:7231/en.wikipedia.org/v1/page/html/$title" | wc -c` "; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; sleep 1; done > comparison1.txt

Results: "direct API access" sheet in https://docs.google.com/spreadsheets/d/1Hd9Ln6ncRNqTNRtj-AXUZyAK214luYIUi3eQr1ZjRtQ/edit#gid=530278332

Mean latencies:

  • action API: 0.109s
  • REST API: 0.041s
  1. I will repeat the results using en.wikipedia.org, will post results here.
  2. Even if that was the case, you can see in my results that parsoid is only 23.71% larger in average (which already is a bug by itself), but the latency is 40.06% larger. You again contradict yourself by saying that the TLS connection is very significant, and at the same time saying that it is due to larger transfer times. I would be glad to be proven wrong, but please use data, as I started doing.

please run your tests from a place that's close to the API.

Again, will you say the same to all our users? For your performance problems, move to Virginia?

Can you share the code for graphite's https://grafana.wikimedia.org/dashboard/db/api-summary?from=1454540400000&to=1454626799999&var-percentile=p95&var-dc=eqiad ?

Can you repeat them from your home/office, using the https protocol, pretty please?

Can you repeat them from your home/office, using the https protocol, pretty please?

Sure, now available as 'cable-wifi-sf' in https://docs.google.com/spreadsheets/d/1Hd9Ln6ncRNqTNRtj-AXUZyAK214luYIUi3eQr1ZjRtQ/edit#gid=1777800445

Commandline:

for i in `seq 1 100`; do line=$(( ( `od -A n -t d -N 4 /dev/urandom` + 2147483648 ) % 12105173)); title=`tail -n+$line enwiki-20160113-all-titles-in-ns0 | head -n1`; echo -n "$title "; echo -n ''; TIME1=`date +%s.%N`; echo -n "`curl --compressed -H 'Accept-Encoding: gzip' -s "https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=$title&format=json" | wc -c` "; TIME2=`date +%s.%N`; echo -n "`echo "$TIME2 - $TIME1" | bc -l` "; TIME1=`date +%s.%N`; echo -n "`curl --compressed -H 'Accept-Encoding: gzip' -s "https://en.wikipedia.org/api/rest_v1/page/html/$title" | wc -c` "; TIME2=`date +%s.%N`; echo `echo "$TIME2 - $TIME1" | bc -l`; sleep 1; done > /tmp/comparison1-cable-wifi-SF.txt

I added Accept-Encoding: gzip to more accurately emulate a browser, and also reduce the impact of the HTML size difference.

Mean latencies:

  • Action API 0.407s
  • REST API 0.348s

This is (again) despite different HTML sizes.

@jcrespo, I'm going to wait for your confirmation before closing this.

GWicke set Security to None.

parsoid is only 23.71% larger in average (which already is a bug by itself)

I will refrain from commenting on the rest of the discussion, but I wanted to clarify something about this. To be deliberatively provocative, I am going to say that this is actually a feature, not a bug. That size increase is because of the data-mw attribute. That JSON blob communicates information about templates, extensions, and images, among other things ( https://www.mediawiki.org/wiki/Parsoid/MediaWiki_DOM_spec ). VE, Flow, CX, Google, Mobile content services use that information to different degrees.

But yes, for normal read views (which isn't yet served by Parsoid HTML but which is a goal of the parsing team), that data-mw is bloat. And in order to move ahead with that goal, we plan to strip out data-mw as an inlined HTML attribute to a separate JSON blob that clients can fetch on demand (See Parsing team entry on https://www.mediawiki.org/wiki/Wikimedia_Engineering/2015-16_Q3_Goals#Editing ).

The reason why all the HTML is pre-cached is because Parsoid gathers up and communicates via data-mw and other attributes and markup, as well as record in data-parsoid which it uses for minimizing dirty diffs .. and this is computationally quite expensive ( https://grafana.wikimedia.org/dashboard/db/parsoid-timing-wt2html ). There is another reason why it is pre-cached. There would be no VE (or CX) if Parsoid couldn't convert edited HTML to wikitext without introducing dirty diffs. That conversion process requires access to the HTML of the previous revision (before the VE edit).

By asking them compressed, you are hitting varnish with Restbase and not varnish with Action API.

Thanks for confirming, finally, the problem: that varnish-cached restbase has about the same performance than uncached, 1000-times-more-loaded action api request.

By asking them compressed, you are hitting varnish with Restbase and not varnish with Action API.

No, both requests are uncached in Varnish at the moment. You can easily verify this by looking at the headers:

curl -v --compressed -H 'Accept-Encoding: gzip' -s "https://en.wikipedia.org/api/rest_v1/page/html/Main_Page" 2>&1 | grep X-Cache
< X-Cache: cp1068 miss+chfp(0), cp4018 pass+chfp(0), cp4016 frontend pass+chfp(0)

We plan to change this for the RESTBase HTML end points once we have solid purging set up.

GWicke claimed this task.

I'm going to wait for your confirmation before closing this.

I want the hear from someone from Performance-Team first. If you do not want to own this, the last thing I want is to bother you more, please unsubscribe and retire the appropiate projects- but *I would love if you were interested on hearing that too*.

I also asked for the code used for graphite, please, if you have the time (a pointer to a repo is enough).

jcrespo raised the priority of this task from Low to Needs Triage.
jcrespo removed projects: RESTBase, Services.
jcrespo removed subscribers: ssastry, GWicke, mobrovac and 2 others.

I also asked for the code used for graphite, please, if you have the time (a pointer to a repo is enough).

I didn't find the varnish backend metric reporter code on github, but am pretty sure that @ori wrote it. There are latency and request rates for each of the varnish backends in graphite.

Tell me where it is running, and I will copy it myself, no problem with that! :-)

Tell me where it is running, and I will copy it myself, no problem with that! :-)

Not sure if it's running on the varnish boxes themselves, or looking at aggregated logs elsewhere. If you find it, lets us know. We could use status code metrics for T123854.

I also asked for the code used for graphite, please, if you have the time (a pointer to a repo is enough).

I didn't find the varnish backend metric reporter code on github, but am pretty sure that @ori wrote it. There are latency and request rates for each of the varnish backends in graphite.

It's https://github.com/wikimedia/operations-puppet/blob/c62c102e8/modules/varnish/files/varnishstatsd

I see this is still 'closed invalid', does everyone agree that this is correct?

@ArielGlenn, an independent assessment would probably help- if you have the time outside your main responsibilities to run the proposed script, it would help.

@jcrespo, what are your results from running the en.wikipedia.org comparison in https://phabricator.wikimedia.org/T125840#1999533 ?

Happy to reopen if you see different results.

@GWicke I have updated the results on the spreadsheet by using en.wikipedia.org domain. Either SSL was the culprit or the code/architecture has improved since I ran the tests.

Results from Europe make Restbase 7-14% faster, and in US 15-21% faster as compared to api.php. There are several things I want to notice, however:

  • The same-domain entry point/deprecated one should be better documented
  • It is essential to implement local serving of data, both for restbase (varnish) and mediawiki (active-active) datacenter, as that can lead to a 10-15% difference for end clients
  • Are you really happy with only a 10-20% difference?, given that -according to that graph, restbase is receiving 20 times less requests/s and
  • doing serialized requests (1 per second) elevates the overall latency of restbase noticeably (see at the end of this graph, while I was running the tests) both the median and the p95, reaching the 0.8 seconds for GETs and the 20 seconds for POSTs. There is a scalability issue there.