- Make sure job queue processing switches over cleanly: T124673
- Likely won't rely on EventBus yet; DC fail-over for that is discussed in T127718: EventBus / Change Propagation DC Failover Scenario
- Check the status of metrics and log aggregation (graphite T127976, logstash T127977)
- Figure out a way to switch the master Action API accessed from various services between DCs (not always local): T125069
- DNS might be too slow / unpredictable
- Network layer feasibility unclear; LVS in current configuration requires all hosts to be on the same network.
- Possibily, set up basic service discovery from various services / set watches on etcd to update the action API URL config variable in-memory. Question is if hundreds of watches are realistic. If we use polling instead, it seems unclear if the effort gains us much over DNS.
- Mentioned In
- T126934: Roll out Restbase to production
- Mentioned Here
- T130370: restbase1007.eqiad.wmnet CPU temperature?
T130254: Investigate recent OOM events on restbase2004
T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications
T124673: Figure out how to migrate the jobqueues
T127976: Graphite DC fail-over / per-DC setup
T127977: Logstash DC fail-over / per-DC setup
T127718: EventBus / Change Propagation DC Failover Scenario
As for the timeline / switch-over strategy, we agreed in the coordination meeting on Feb 24th to something along these lines:
Switching to codfw
- Verify DNS switching / resolve MW API switching question: T125069
- Set up services in codfw to all talk to each other in codfw, but use the eqiad Action API.
- Test the system, incl. switch-over / service discovery.
- Switch traffic to codfw services before general database switch.
- Service discovery, again. Config updates (MW, job queue, services) or DNS.
- Along with MW master switch, change Action API URL in codfw to point to local API cluster.
- Update Parsoid config to point to codfw m/w api + restart Parsoid
- Update Flow / VRS config to point to codfw Parsoid (this is probably going to happen as part of the mediawiki switchover so maybe nothing special to do here)
Other Parsoid clients that don't proxy through RESTBase:
- Parsoid visual diff testing (but, we can do this after everything is done, nothing critical)
- OCG -- need to check with Scott if there is anything that talks to Parsoid directly anymore
- IEG Review / Scholarships code (but from what I remember @bd808 telling me earlier, this is not critical either, and can be update after everything else is switched over).
Updates on the timeline:
- The traffic switch-over for RB is likely to happen next week (14th - 18th). To be finalized in tomorrow's DC fail-over meeting.
- To prepare for this, we should finish testing RB and Parsoid in codfw this week.
The job queue will be ready in codfw by the time the MediaWiki fail-over happens. The timeline for this is quite tight, with extra hardware only scheduled to arrive next week.
To prepare for the MediaWiki master switch, we need to document the configs we need to update / order of changes. Lets link this documentation from https://wikitech.wikimedia.org/wiki/Switch_Datacenter.
Parsoid has been deployed to eqiad and codfw for a while now. The codfw parsoid cluster of course talks to the eqiad MW api. After every deploy, we do a curl http://localhost:8000/_version which returns the right git sha.
Just now, I verified on wtp2001.codfw.net that a curl to enwiki/Hospet and enwiki/Hospet?oldid=706315339 work as expected.
RESTBase in codfw is also working as expected. In addition to things that are already covered by monitoring, I checked no-cache processing of HTML.
I do not think RESTBase in codfw is configured to use codfw Parsoid yet, though. We should update this, and re-test no-cache handling before switching traffic.
We are planning a test for RESTBase, Parsoid and SC[AB] services this Thursday, 2016-03-17 lasting for three hours, from 10:00 UTC to 13:00 UTC. The idea is to have all of these services talk to each other intra-DC only, with the exception of MW, which the services will reach out to in eqiad. In practice this means that only live RB traffic will be diverted to codfw: storage requests, Parsoid transforms, Mathoid and Graphoid renders and external no-cache requests (if some happen).
Most of the things are ready - RESTBase and SC[AB] services talk to each other only using the local DC. The last step is to divert the Varnish back-ends to codfw for RESTBase, Citoid and CXServer.
Overall, the fail-over went smoothly, with no user-visible issues.
After switching job queue updates to codfw, there were however some issues:
- Cassandra on 2004 showed a corruption, which caused it to OOM on compaction: T130254: Investigate recent OOM events on restbase2004. This does not seem to be directly related to the DC switch-over, but seems to be triggered by a periodic large compaction that happened to start around the time of the DC fail-over.
- Cassandra on 1007 crashed, possibly related to an ongoing repair / CPU overheating: T130370: restbase1007.eqiad.wmnet CPU temperature?