Services DC switch-over checklist / tracking task
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• GWicke
	Feb 24 2016, 3:53 PM

Description

Make sure job queue processing switches over cleanly: T124673
- Likely won't rely on EventBus yet; DC fail-over for that is discussed in T127718: EventBus / Change Propagation DC Failover Scenario
Check the status of metrics and log aggregation (graphite T127976, logstash T127977)
Figure out a way to switch the master Action API accessed from various services between DCs (not always local): T125069
- DNS might be too slow / unpredictable
- Network layer feasibility unclear; LVS in current configuration requires all hosts to be on the same network.
- Possibily, set up basic service discovery from various services / set watches on etcd to update the action API URL config variable in-memory. Question is if hundreds of watches are realistic. If we use polling instead, it seems unclear if the effort gains us much over DNS.

Details

Subject	Repo	Branch	Lines +/-
cache::text: route all restbase traffic to codfw	operations/puppet	production	+1 -1
Switch temporarily eqiad to use the codfw restbase cluster	operations/mediawiki-config	master	+1 -1
Use the local restbase cluster in codfw	operations/mediawiki-config	master	+2 -2

Customize query in gerrit

Related Objects

Mentioned In: T126934: Roll out Restbase to production
Mentioned Here: T130370: restbase1007.eqiad.wmnet CPU temperature?
T130254: Investigate recent OOM events on restbase2004
T125069: Create a service location / discovery system for locating local/master resources easily across all WMF applications
T124673: Figure out how to migrate the jobqueues
T127976: Graphite DC fail-over / per-DC setup
T127977: Logstash DC fail-over / per-DC setup
T127718: EventBus / Change Propagation DC Failover Scenario

Event Timeline

• GWicke created this task.Feb 24 2016, 3:53 PM

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptFeb 24 2016, 3:53 PM

• GWicke added projects: Services-next, codfw-rollout-Jan-Mar-2016, Services.Feb 24 2016, 3:54 PM

Restricted Application added a project: codfw-rollout. · View Herald TranscriptFeb 24 2016, 3:54 PM

• GWicke triaged this task as High priority.Feb 24 2016, 3:55 PM

• GWicke updated the task description. (Show Details)

• GWicke edited subscribers, added: Eevans, • mobrovac, • Pchelolo and 5 others; removed: Aklapper.

• GWicke updated the task description. (Show Details)Feb 24 2016, 4:04 PM

• GWicke renamed this task from Pre-DC switch-over Services checklist to Services Pre-DC switch-over checklist.Feb 24 2016, 4:12 PM

• GWicke updated the task description. (Show Details)

As for the timeline / switch-over strategy, we agreed in the coordination meeting on Feb 24th to something along these lines:

Switching to codfw

Verify DNS switching / resolve MW API switching question: T125069
Set up services in codfw to all talk to each other in codfw, but use the eqiad Action API.
Test the system, incl. switch-over / service discovery.
Switch traffic to codfw services before general database switch.
- Service discovery, again. Config updates (MW, job queue, services) or DNS.
Along with MW master switch, change Action API URL in codfw to point to local API cluster.

• GWicke renamed this task from Services Pre-DC switch-over checklist to Services DC switch-over checklist / tracking task.Feb 24 2016, 8:45 PM

Other tasks:

Update Parsoid config to point to codfw m/w api + restart Parsoid
Update Flow / VRS config to point to codfw Parsoid (this is probably going to happen as part of the mediawiki switchover so maybe nothing special to do here)

Other Parsoid clients that don't proxy through RESTBase:

Parsoid visual diff testing (but, we can do this after everything is done, nothing critical)
OCG -- need to check with Scott if there is anything that talks to Parsoid directly anymore
IEG Review / Scholarships code (but from what I remember @bd808 telling me earlier, this is not critical either, and can be update after everything else is switched over).

Updates on the timeline:

The traffic switch-over for RB is likely to happen next week (14th - 18th). To be finalized in tomorrow's DC fail-over meeting.
To prepare for this, we should finish testing RB and Parsoid in codfw this week.

The job queue will be ready in codfw by the time the MediaWiki fail-over happens. The timeline for this is quite tight, with extra hardware only scheduled to arrive next week.

To prepare for the MediaWiki master switch, we need to document the configs we need to update / order of changes. Lets link this documentation from https://wikitech.wikimedia.org/wiki/Switch_Datacenter.

Parsoid has been deployed to eqiad and codfw for a while now. The codfw parsoid cluster of course talks to the eqiad MW api. After every deploy, we do a curl http://localhost:8000/_version which returns the right git sha.

Just now, I verified on wtp2001.codfw.net that a curl to enwiki/Hospet and enwiki/Hospet?oldid=706315339 work as expected.

RESTBase in codfw is also working as expected. In addition to things that are already covered by monitoring, I checked no-cache processing of HTML.

I do not think RESTBase in codfw is configured to use codfw Parsoid yet, though. We should update this, and re-test no-cache handling before switching traffic.

In T127974#2103669, @GWicke wrote:

I do not think RESTBase in codfw is configured to use codfw Parsoid yet, though. We should update this, and re-test no-cache handling before switching traffic.

It will after https://gerrit.wikimedia.org/r/#/c/275536/ is merged.

• GWicke mentioned this in T126934: Roll out Restbase to production.Mar 9 2016, 6:44 PM

• Mholloway subscribed.Mar 9 2016, 7:10 PM

We are planning a test for RESTBase, Parsoid and SC[AB] services this Thursday, 2016-03-17 lasting for three hours, from 10:00 UTC to 13:00 UTC. The idea is to have all of these services talk to each other intra-DC only, with the exception of MW, which the services will reach out to in eqiad. In practice this means that only live RB traffic will be diverted to codfw: storage requests, Parsoid transforms, Mathoid and Graphoid renders and external no-cache requests (if some happen).

Most of the things are ready - RESTBase and SC[AB] services talk to each other only using the local DC. The last step is to divert the Varnish back-ends to codfw for RESTBase, Citoid and CXServer.

Change 277798 had a related patch set uploaded (by Giuseppe Lavagetto):
cache::text: route all restbase traffic to codfw

https://gerrit.wikimedia.org/r/277798

gerritbot added a project: Patch-For-Review.Mar 16 2016, 4:44 PM

Change 277803 had a related patch set uploaded (by Giuseppe Lavagetto):
Use the local restbase cluster in codfw

https://gerrit.wikimedia.org/r/277803

Change 277804 had a related patch set uploaded (by Giuseppe Lavagetto):
Switch temporarily eqiad to use the codfw restbase cluster

https://gerrit.wikimedia.org/r/277804

Change 277798 merged by Ema:
cache::text: route all restbase traffic to codfw

https://gerrit.wikimedia.org/r/277798

Change 277803 merged by jenkins-bot:
Use the local restbase cluster in codfw

https://gerrit.wikimedia.org/r/277803

Change 277804 merged by Giuseppe Lavagetto:
Switch temporarily eqiad to use the codfw restbase cluster

https://gerrit.wikimedia.org/r/277804

Overall, the fail-over went smoothly, with no user-visible issues.

After switching job queue updates to codfw, there were however some issues:

Cassandra on 2004 showed a corruption, which caused it to OOM on compaction: T130254: Investigate recent OOM events on restbase2004. This does not seem to be directly related to the DC switch-over, but seems to be triggered by a periodic large compaction that happened to start around the time of the DC fail-over.
Cassandra on 1007 crashed, possibly related to an ongoing repair / CPU overheating: T130370: restbase1007.eqiad.wmnet CPU temperature?

Krinkle moved this task from Backlog to In Progress on the codfw-rollout-Jan-Mar-2016 board.Apr 21 2016, 3:02 PM

Krinkle moved this task from In Progress to Done on the codfw-rollout-Jan-Mar-2016 board.

• GWicke lowered the priority of this task from High to Medium.Apr 25 2016, 11:44 PM

• GWicke removed a project: Services-next.

Should we resolve this task?

Yeah.

Services DC switch-over checklist / tracking taskClosed, ResolvedPublicActions