Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Jan 25 2016, 7:52 PM

Description

Create a script to curl random valid mediawiki URLs
Configure db-codfw.php
Execute if from some time before the actual switchover

Details

	Subject	Repo	Branch	Lines +/-
	Prepare db-codfw.php for a live deployment	operations/mediawiki-config	master	+258 -61

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Invalid	None	T125673 Switch over from Eqiad to Codfw as primary datacentre
Resolved	• jcrespo	T124670 Figure out and document the datacenter switchover process
Resolved	Krinkle	T124671 Switchover of the application servers to codfw
Resolved	• jcrespo	T124697 Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming
Resolved	• jcrespo	T125386 Clarify how mysql dumps will be architectured during codfw failover
Resolved	• jcrespo	T124795 codfw is in read only according to mediawiki

Event Timeline

• jcrespo created this task.Jan 25 2016, 7:52 PM

• jcrespo raised the priority of this task from to Needs Triage.

• jcrespo updated the task description. (Show Details)

• jcrespo added projects: codfw-rollout-Jan-Mar-2016, DBA.

• jcrespo subscribed.

Restricted Application added a project: codfw-rollout. · View Herald TranscriptJan 25 2016, 7:52 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald Transcript

• jcrespo added a project: SRE.Jan 25 2016, 7:52 PM

• jcrespo added a project: Performance Issue.

• jcrespo added a subscriber: faidon.

• jcrespo mentioned this in T121879: rack/setup pc2004-2006.Jan 28 2016, 12:53 PM

Change 267659 had a related patch set uploaded (by Jcrespo):
Delete eqiad masters from codfw configuration and add db weights

https://gerrit.wikimedia.org/r/267659

gerritbot added a project: Patch-For-Review.Feb 1 2016, 11:38 AM

• jcrespo closed subtask T125386: Clarify how mysql dumps will be architectured during codfw failover as Resolved.Feb 1 2016, 12:19 PM

• jcrespo added a parent task: T124671: Switchover of the application servers to codfw.Feb 1 2016, 12:40 PM

• jcrespo added a subtask: T124795: codfw is in read only according to mediawiki.Mar 9 2016, 7:05 PM

Change 267659 merged by jenkins-bot:
Prepare db-codfw.php for a live deployment

https://gerrit.wikimedia.org/r/267659

• jcrespo closed subtask T124795: codfw is in read only according to mediawiki as Resolved.Mar 10 2016, 7:09 PM

The first phase will be mass-request: testwiki or test2wiki with Special:BlankPage.

• jcrespo added subscribers: Krinkle, aaron.Mar 10 2016, 7:12 PM

I want to do the following before running large scale stress tests on codfw servers:

Monitor outgoing connections on a pooled eqiad app server for a minute. Also run a few test urls ourselves during this.
Audit each IP/hostname and verify whether we thought of those in our plan.
- Make sure that those same connections are fine if happening from a codfw server with possible lagged data from a local master.
- Make sure each of those services have codfw equivalents or are otherwise acceptable to happen cross data centre (e.g. statsd/graphite).
Monitor outgoing connections on a codfw app server and run the few urls once.
Audit each IP/hostname and verify that things we think use local codfw instances do in fact use local codfw instances.

• jcrespo mentioned this in T124671: Switchover of the application servers to codfw.Mar 11 2016, 7:23 AM

• Elitre subscribed.Mar 11 2016, 10:09 AM

Finished audit of the network analysis. Performed on mw1017 and mw2099 and used test2.wikipedia.org to view Special:BlankPage, a random page, attempt to view and change Special:Preferences, attempt to edit a page.

Details in below transclusions:

{P2739}

In P2739#11615, @Krinkle wrote:

Outgoing to within Eqiad:

239.128.0.112: Purge Varnish (HTCPMulticast)

argon: Send wiki events (IRC)

carbon: Send monitoring (Ganglia)

db10*, es10*: Read and write database queries (MySQL master/slave)

dns-rec-lb: DNS resolution (via LVS)

eventbus.svc: Send wiki events (http, EventBus)

eventlog1001: Send wiki events (udp, EventLogging)

fluorine: Application logs (udp2log)

graphite1001: Application analytics (statsd)

lithium: Application logs (syslog)

logstash10*: Application logs (elasticsearch)

mc10*: Read and write object caching (memcached)

neodymium: Cluster management (salt)

pc10*: Read and write parser cache (mysql)

rdb10*: Enqueue background jobs (redis)

tungsten: Application debug profiling (xhprof, xhgui, mongodb)

Incoming and responding within Eqiad:

hassaleh: Traffic proxy (debug_proxy)

text-lb: Traffic proxy (http, ssl; varnish, nginx)

neon: Send monitoring (icinga)

bast1001 - hooft.esams: (my ssh session)

{P2744}

In P2744#11616, @Krinkle wrote:

Outgoing to within Codfw:

db20*, es20*: Database queries (MySQL)

chromium: DNS

dns-rec-lb: DNS resolution (via LVS)

install2001: Send monitoring (Ganglia)

mc20*: Object cache (memcached)

pc20*: Parser cache (mysql)

suhail: Application lock management (poolcounter)

lvs2003: Traffic proxy category "low-traffic" (appservers, eventbus, ..)

Incoming and responding within Codfw:

hassaleh: Traffic proxy (debug_proxy)

text-lb: Traffic proxy (http, ssl; varnish, nginx)

Outgoing to within Eqiad:

fluorine: Application logs (udp2log)

graphite1001: Application analytics (statsd)

lithium: Application logs (syslog)

logstash10*: Application logs (elastic)

tungsten: Application debug profiling (xhprof, xhgui, mongodb)

Incoming and responding within Eqiad:

neon: Send monitoring (icinga)

hooft.esams: (my ssh session)

There don't seem to be any unexpected connections.

Codfw uses the right local services.
Codfw-Eqiad is only logs and monitoring.

• jcrespo claimed this task.Mar 16 2016, 12:21 PM

• jcrespo triaged this task as High priority.

• jcrespo moved this task from Triage to In progress on the DBA board.

• jcrespo moved this task from Backlog to In Progress on the codfw-rollout-Jan-Mar-2016 board.

• jcrespo moved this task from In progress to Pending comment on the DBA board.Mar 23 2016, 2:14 PM

• jcrespo moved this task from Pending comment to In progress on the DBA board.Mar 30 2016, 10:10 AM

The following tests have been done:

Mass-request Special:Blank once per shard
Mass-request the Main page once per shard
Mass-request a non-existent page (404)

This has been done in parallel using 5 mediawiki servers, with 100 simultaneous connections per server. For a total of 100000 requests for 300-600 seconds. The load generated was between 0.5 and 2x the load at eqiad.

Percentage of the requests served within a certain time (ms). Typical response times are similar to this:

 50%    607
 66%    658
 75%    691
 80%    711
 90%    765
 95%    809
 98%    859
 99%    893
100%  21114

With a 50-percentile 300-700 ms; a 99-percentile of 500-900 ms and a very large max request time. These results are not very useful without the possibility of comparing them with eqiad, but that is not possible as it is currently in production.

Other tests done:

Requesting random enwiki page titles in 5 parallel threads
Requesting enwiki parsing (using api.php) in 5 parallel threads
Requesting recentchanges special page for enwiki in 5 parallel threads

Parsing times for the first time takes a variable amount of time (between 6-1 seconds). After leaving the requests for some time, average request time gets to around 0.3-0.4 seconds (assuming due to them being already on the parsercache/memcache).

Although I have the full results for these tests, performance was not the goal of this, catching regressions and load issues was. As it seems, there was only one occasion in which errors appeared, and that was when requesting the main page for meta for the first time, as the first requests went over the predefined timeout of 30 seconds. So the conclusions from preliminary testing is that all essential mediawiki read-only functions (reading all pages, using the API, using special pages, non-existent pages) work, but there could be some pile-ups due to several buffers being cold.
Databases, as predicted, seems overprovisioned enough to attend all requests, although it may still need some tweaks after failover.

I am closing this task because I have done everything I wanted to test functionality. I have not tested non-pure mediawiki services outside of memcache/parsercache/coredbs/external storage. Flow/x1, restbase, swift, elastic where outside of the scope of this ticket. Feel free to reopen it if you want to do some more testing on your own.

• jcrespo mentioned this in T128185: Prepare mysql account and options for prometheus.Apr 1 2016, 4:16 PM

Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warmingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming
Closed, ResolvedPublic
Actions

Related Objects
Search...