Page MenuHomePhabricator

Services DC switch-over checklist / tracking task
Closed, ResolvedPublic

Description

  • Make sure job queue processing switches over cleanly: T124673
  • Check the status of metrics and log aggregation (graphite T127976, logstash T127977)
  • Figure out a way to switch the master Action API accessed from various services between DCs (not always local): T125069
    • DNS might be too slow / unpredictable
    • Network layer feasibility unclear; LVS in current configuration requires all hosts to be on the same network.
    • Possibily, set up basic service discovery from various services / set watches on etcd to update the action API URL config variable in-memory. Question is if hundreds of watches are realistic. If we use polling instead, it seems unclear if the effort gains us much over DNS.

Event Timeline

GWicke updated the task description. (Show Details)
GWicke edited subscribers, added: Eevans, mobrovac, Pchelolo and 5 others; removed: Aklapper.
GWicke renamed this task from Pre-DC switch-over Services checklist to Services Pre-DC switch-over checklist.Feb 24 2016, 4:12 PM
GWicke updated the task description. (Show Details)
GWicke updated the task description. (Show Details)

As for the timeline / switch-over strategy, we agreed in the coordination meeting on Feb 24th to something along these lines:

Switching to codfw

  • Verify DNS switching / resolve MW API switching question: T125069
  • Set up services in codfw to all talk to each other in codfw, but use the eqiad Action API.
  • Test the system, incl. switch-over / service discovery.
  • Switch traffic to codfw services before general database switch.
    • Service discovery, again. Config updates (MW, job queue, services) or DNS.
  • Along with MW master switch, change Action API URL in codfw to point to local API cluster.
GWicke renamed this task from Services Pre-DC switch-over checklist to Services DC switch-over checklist / tracking task.Feb 24 2016, 8:45 PM

Other tasks:

  • Update Parsoid config to point to codfw m/w api + restart Parsoid
  • Update Flow / VRS config to point to codfw Parsoid (this is probably going to happen as part of the mediawiki switchover so maybe nothing special to do here)

Other Parsoid clients that don't proxy through RESTBase:

  • Parsoid visual diff testing (but, we can do this after everything is done, nothing critical)
  • OCG -- need to check with Scott if there is anything that talks to Parsoid directly anymore
  • IEG Review / Scholarships code (but from what I remember @bd808 telling me earlier, this is not critical either, and can be update after everything else is switched over).

Updates on the timeline:

  • The traffic switch-over for RB is likely to happen next week (14th - 18th). To be finalized in tomorrow's DC fail-over meeting.
  • To prepare for this, we should finish testing RB and Parsoid in codfw this week.

The job queue will be ready in codfw by the time the MediaWiki fail-over happens. The timeline for this is quite tight, with extra hardware only scheduled to arrive next week.

To prepare for the MediaWiki master switch, we need to document the configs we need to update / order of changes. Lets link this documentation from https://wikitech.wikimedia.org/wiki/Switch_Datacenter.

Parsoid has been deployed to eqiad and codfw for a while now. The codfw parsoid cluster of course talks to the eqiad MW api. After every deploy, we do a curl http://localhost:8000/_version which returns the right git sha.

Just now, I verified on wtp2001.codfw.net that a curl to enwiki/Hospet and enwiki/Hospet?oldid=706315339 work as expected.

RESTBase in codfw is also working as expected. In addition to things that are already covered by monitoring, I checked no-cache processing of HTML.

I do not think RESTBase in codfw is configured to use codfw Parsoid yet, though. We should update this, and re-test no-cache handling before switching traffic.

I do not think RESTBase in codfw is configured to use codfw Parsoid yet, though. We should update this, and re-test no-cache handling before switching traffic.

It will after https://gerrit.wikimedia.org/r/#/c/275536/ is merged.

We are planning a test for RESTBase, Parsoid and SC[AB] services this Thursday, 2016-03-17 lasting for three hours, from 10:00 UTC to 13:00 UTC. The idea is to have all of these services talk to each other intra-DC only, with the exception of MW, which the services will reach out to in eqiad. In practice this means that only live RB traffic will be diverted to codfw: storage requests, Parsoid transforms, Mathoid and Graphoid renders and external no-cache requests (if some happen).

Most of the things are ready - RESTBase and SC[AB] services talk to each other only using the local DC. The last step is to divert the Varnish back-ends to codfw for RESTBase, Citoid and CXServer.

Change 277798 had a related patch set uploaded (by Giuseppe Lavagetto):
cache::text: route all restbase traffic to codfw

https://gerrit.wikimedia.org/r/277798

Change 277803 had a related patch set uploaded (by Giuseppe Lavagetto):
Use the local restbase cluster in codfw

https://gerrit.wikimedia.org/r/277803

Change 277804 had a related patch set uploaded (by Giuseppe Lavagetto):
Switch temporarily eqiad to use the codfw restbase cluster

https://gerrit.wikimedia.org/r/277804

Change 277798 merged by Ema:
cache::text: route all restbase traffic to codfw

https://gerrit.wikimedia.org/r/277798

Change 277803 merged by jenkins-bot:
Use the local restbase cluster in codfw

https://gerrit.wikimedia.org/r/277803

Change 277804 merged by Giuseppe Lavagetto:
Switch temporarily eqiad to use the codfw restbase cluster

https://gerrit.wikimedia.org/r/277804

Overall, the fail-over went smoothly, with no user-visible issues.

After switching job queue updates to codfw, there were however some issues:

GWicke lowered the priority of this task from High to Medium.Apr 25 2016, 11:44 PM
GWicke removed a project: Services-next.
mobrovac claimed this task.
mobrovac removed a project: Patch-For-Review.
mobrovac removed a subscriber: gerritbot.

Yeah.