- Create a script to curl random valid mediawiki URLs
- Configure db-codfw.php
- Execute if from some time before the actual switchover
|Invalid||None||T125673 Switch over from Eqiad to Codfw as primary datacentre|
|Resolved||jcrespo||T124670 Figure out and document the datacenter switchover process|
|Resolved||Krinkle||T124671 Switchover of the application servers to codfw|
|Resolved||jcrespo||T124697 Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming|
|Resolved||jcrespo||T125386 Clarify how mysql dumps will be architectured during codfw failover|
|Resolved||jcrespo||T124795 codfw is in read only according to mediawiki|
I want to do the following before running large scale stress tests on codfw servers:
- Monitor outgoing connections on a pooled eqiad app server for a minute. Also run a few test urls ourselves during this.
- Audit each IP/hostname and verify whether we thought of those in our plan.
- Make sure that those same connections are fine if happening from a codfw server with possible lagged data from a local master.
- Make sure each of those services have codfw equivalents or are otherwise acceptable to happen cross data centre (e.g. statsd/graphite).
- Monitor outgoing connections on a codfw app server and run the few urls once.
- Audit each IP/hostname and verify that things we think use local codfw instances do in fact use local codfw instances.
Finished audit of the network analysis. Performed on mw1017 and mw2099 and used test2.wikipedia.org to view Special:BlankPage, a random page, attempt to view and change Special:Preferences, attempt to edit a page.
Details in below transclusions:
There don't seem to be any unexpected connections.
- Codfw uses the right local services.
- Codfw-Eqiad is only logs and monitoring.
The following tests have been done:
- Mass-request Special:Blank once per shard
- Mass-request the Main page once per shard
- Mass-request a non-existent page (404)
This has been done in parallel using 5 mediawiki servers, with 100 simultaneous connections per server. For a total of 100000 requests for 300-600 seconds. The load generated was between 0.5 and 2x the load at eqiad.
Percentage of the requests served within a certain time (ms). Typical response times are similar to this:
50% 607 66% 658 75% 691 80% 711 90% 765 95% 809 98% 859 99% 893 100% 21114
With a 50-percentile 300-700 ms; a 99-percentile of 500-900 ms and a very large max request time. These results are not very useful without the possibility of comparing them with eqiad, but that is not possible as it is currently in production.
Other tests done:
- Requesting random enwiki page titles in 5 parallel threads
- Requesting enwiki parsing (using api.php) in 5 parallel threads
- Requesting recentchanges special page for enwiki in 5 parallel threads
Parsing times for the first time takes a variable amount of time (between 6-1 seconds). After leaving the requests for some time, average request time gets to around 0.3-0.4 seconds (assuming due to them being already on the parsercache/memcache).
Although I have the full results for these tests, performance was not the goal of this, catching regressions and load issues was. As it seems, there was only one occasion in which errors appeared, and that was when requesting the main page for meta for the first time, as the first requests went over the predefined timeout of 30 seconds. So the conclusions from preliminary testing is that all essential mediawiki read-only functions (reading all pages, using the API, using special pages, non-existent pages) work, but there could be some pile-ups due to several buffers being cold.
Databases, as predicted, seems overprovisioned enough to attend all requests, although it may still need some tweaks after failover.
I am closing this task because I have done everything I wanted to test functionality. I have not tested non-pure mediawiki services outside of memcache/parsercache/coredbs/external storage. Flow/x1, restbase, swift, elastic where outside of the scope of this ticket. Feel free to reopen it if you want to do some more testing on your own.