Page MenuHomePhabricator

Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming
Closed, ResolvedPublic

Description

  • Create a script to curl random valid mediawiki URLs
  • Configure db-codfw.php
  • Execute if from some time before the actual switchover

Event Timeline

jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo subscribed.
Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

Change 267659 had a related patch set uploaded (by Jcrespo):
Delete eqiad masters from codfw configuration and add db weights

https://gerrit.wikimedia.org/r/267659

Change 267659 merged by jenkins-bot:
Prepare db-codfw.php for a live deployment

https://gerrit.wikimedia.org/r/267659

The first phase will be mass-request: testwiki or test2wiki with Special:BlankPage.

I want to do the following before running large scale stress tests on codfw servers:

  • Monitor outgoing connections on a pooled eqiad app server for a minute. Also run a few test urls ourselves during this.
  • Audit each IP/hostname and verify whether we thought of those in our plan.
    • Make sure that those same connections are fine if happening from a codfw server with possible lagged data from a local master.
    • Make sure each of those services have codfw equivalents or are otherwise acceptable to happen cross data centre (e.g. statsd/graphite).
  • Monitor outgoing connections on a codfw app server and run the few urls once.
  • Audit each IP/hostname and verify that things we think use local codfw instances do in fact use local codfw instances.

Finished audit of the network analysis. Performed on mw1017 and mw2099 and used test2.wikipedia.org to view Special:BlankPage, a random page, attempt to view and change Special:Preferences, attempt to edit a page.

Details in below transclusions:

{P2739}

In P2739#11615, @Krinkle wrote:

Outgoing to within Eqiad:

  • 239.128.0.112: Purge Varnish (HTCPMulticast)
  • argon: Send wiki events (IRC)
  • carbon: Send monitoring (Ganglia)
  • db10*, es10*: Read and write database queries (MySQL master/slave)
  • dns-rec-lb: DNS resolution (via LVS)
  • eventbus.svc: Send wiki events (http, EventBus)
  • eventlog1001: Send wiki events (udp, EventLogging)
  • fluorine: Application logs (udp2log)
  • graphite1001: Application analytics (statsd)
  • lithium: Application logs (syslog)
  • logstash10*: Application logs (elasticsearch)
  • mc10*: Read and write object caching (memcached)
  • neodymium: Cluster management (salt)
  • pc10*: Read and write parser cache (mysql)
  • rdb10*: Enqueue background jobs (redis)
  • tungsten: Application debug profiling (xhprof, xhgui, mongodb)

Incoming and responding within Eqiad:

  • hassaleh: Traffic proxy (debug_proxy)
  • text-lb: Traffic proxy (http, ssl; varnish, nginx)
  • neon: Send monitoring (icinga)
  • bast1001 - hooft.esams: (my ssh session)

{P2744}

In P2744#11616, @Krinkle wrote:

Outgoing to within Codfw:

  • db20*, es20*: Database queries (MySQL)
  • chromium: DNS
  • dns-rec-lb: DNS resolution (via LVS)
  • install2001: Send monitoring (Ganglia)
  • mc20*: Object cache (memcached)
  • pc20*: Parser cache (mysql)
  • suhail: Application lock management (poolcounter)
  • lvs2003: Traffic proxy category "low-traffic" (appservers, eventbus, ..)

Incoming and responding within Codfw:

  • hassaleh: Traffic proxy (debug_proxy)
  • text-lb: Traffic proxy (http, ssl; varnish, nginx)

Outgoing to within Eqiad:

  • fluorine: Application logs (udp2log)
  • graphite1001: Application analytics (statsd)
  • lithium: Application logs (syslog)
  • logstash10*: Application logs (elastic)
  • tungsten: Application debug profiling (xhprof, xhgui, mongodb)

Incoming and responding within Eqiad:

  • neon: Send monitoring (icinga)
  • hooft.esams: (my ssh session)

There don't seem to be any unexpected connections.

  • Codfw uses the right local services.
  • Codfw-Eqiad is only logs and monitoring.
jcrespo triaged this task as High priority.
jcrespo moved this task from Triage to In progress on the DBA board.
jcrespo moved this task from Backlog to In Progress on the codfw-rollout-Jan-Mar-2016 board.

The following tests have been done:

  • Mass-request Special:Blank once per shard
  • Mass-request the Main page once per shard
  • Mass-request a non-existent page (404)

This has been done in parallel using 5 mediawiki servers, with 100 simultaneous connections per server. For a total of 100000 requests for 300-600 seconds. The load generated was between 0.5 and 2x the load at eqiad.

Percentage of the requests served within a certain time (ms). Typical response times are similar to this:

 50%    607
 66%    658
 75%    691
 80%    711
 90%    765
 95%    809
 98%    859
 99%    893
100%  21114

With a 50-percentile 300-700 ms; a 99-percentile of 500-900 ms and a very large max request time. These results are not very useful without the possibility of comparing them with eqiad, but that is not possible as it is currently in production.

Other tests done:

  • Requesting random enwiki page titles in 5 parallel threads
  • Requesting enwiki parsing (using api.php) in 5 parallel threads
  • Requesting recentchanges special page for enwiki in 5 parallel threads

Parsing times for the first time takes a variable amount of time (between 6-1 seconds). After leaving the requests for some time, average request time gets to around 0.3-0.4 seconds (assuming due to them being already on the parsercache/memcache).

Although I have the full results for these tests, performance was not the goal of this, catching regressions and load issues was. As it seems, there was only one occasion in which errors appeared, and that was when requesting the main page for meta for the first time, as the first requests went over the predefined timeout of 30 seconds. So the conclusions from preliminary testing is that all essential mediawiki read-only functions (reading all pages, using the API, using special pages, non-existent pages) work, but there could be some pile-ups due to several buffers being cold.
Databases, as predicted, seems overprovisioned enough to attend all requests, although it may still need some tweaks after failover.

I am closing this task because I have done everything I wanted to test functionality. I have not tested non-pure mediawiki services outside of memcache/parsercache/coredbs/external storage. Flow/x1, restbase, swift, elastic where outside of the scope of this ticket. Feel free to reopen it if you want to do some more testing on your own.