Page MenuHomePhabricator

Reinstall and data reload of WDQS servers
Closed, ResolvedPublic

Description

As part of T123565 we need to do a data reload to re-index data for Geosearch. At the same time, we will do a full reinstall of wdqs1001 to enable use of new disk space.

Event Timeline

Planned sequence:

  1. (day before) Send email to the wikidata list
  2. Take wdq1001 out of varnish config
  3. Shut down and reimage wdq1001. Verify disk partitioning is correct.
  4. Deploy new code from wdq-deploy repo. Do NOT restart wdq1002 yet!
  5. Reload data to wdq1001 from https://dumps.wikimedia.org/wikidatawiki/entities/20160425/ dump ttl-gz version (should be ready by then)
  6. Start updater on wdq1001 and wait for it to catch up
  7. Re-add wdq1001 to varnish, verify it's ready to serve requests
  8. Disable updater or wdq1002
  9. Put wdq1002 into maintenance mode (no need to take it out of varnish as we are only reloading data, not reimaging)
  10. Reload wdq1002 data from the same dump as above.
  11. Re-enable updater on wdq1002 and wait until it catches up
  12. Remove maintenance mode from wdq1002
  13. Verify everything works fine and queries run on both servers
  14. Send the victory email to wikidata
  15. PROFIT!

Change 285345 had a related patch set uploaded (by Gehel):
Depooled wdqs1001 during reinstall

https://gerrit.wikimedia.org/r/285345

Change 285345 merged by Gehel:
Depooled wdqs1001 during reinstall

https://gerrit.wikimedia.org/r/285345

Change 285353 had a related patch set uploaded (by Gehel):
Modify partitions to reflect new disk added in WDQS nodes

https://gerrit.wikimedia.org/r/285353

Change 285353 merged by Gehel:
Modify partitions to reflect new disk added in WDQS nodes

https://gerrit.wikimedia.org/r/285353

Planned sequence:

  1. (day before) Send email to the wikidata list
  2. Take wdq1001 out of varnish config
  3. (IN PROGRESS) Shut down and reimage wdq1001. Verify disk partitioning is correct.
  4. Deploy new code from wdq-deploy repo. Do NOT restart wdq1002 yet! (should be automatically done as part of reimage)
  5. Reload data to wdq1001 from https://dumps.wikimedia.org/wikidatawiki/entities/20160425/ dump ttl-gz version (should be ready by then)
  6. Start updater on wdq1001 and wait for it to catch up
  7. Re-add wdq1001 to varnish, verify it's ready to serve requests
  8. Disable updater or wdq1002
  9. Put wdq1002 into maintenance mode (no need to take it out of varnish as we are only reloading data, not reimaging)
  10. Reload wdq1002 data from the same dump as above.
  11. Re-enable updater on wdq1002 and wait until it catches up
  12. Remove maintenance mode from wdq1002
  13. Verify everything works fine and queries run on both servers
  14. Send the victory email to wikidata
  15. PROFIT!

Mentioned in SAL [2016-04-26T09:50:10Z] <gehel> starting reinstall of wdqs1001 (T133566)

While rebuilding the RAID to add new disks, I realized wdqs1001 has 2x 300GB + 2x 150GB disks. I'm reinstalling anyway to ensure we don't run on a single node, but it does not look like what was planned in T119579 / T120712. i'll check with @RobH and/or @Cmjohnson when they arrive.

Change 285387 had a related patch set uploaded (by Gehel):
WDQS - Smaller /var/lib/wdqs partition

https://gerrit.wikimedia.org/r/285387

Change 285387 merged by Gehel:
WDQS - Smaller /var/lib/wdqs partition

https://gerrit.wikimedia.org/r/285387

Change 285716 had a related patch set uploaded (by Gehel):
Revert "Depooled wdqs1001 during reinstall"

https://gerrit.wikimedia.org/r/285716

Change 285716 merged by Gehel:
Revert "Depooled wdqs1001 during reinstall"

https://gerrit.wikimedia.org/r/285716

Mentioned in SAL [2016-04-27T20:09:09Z] <gehel> adding back wdqs1001 to varnish configuration after reinstall (T133566)

Mentioned in SAL [2016-04-27T20:32:13Z] <gehel> switching wdqs1002 to maintenance and reimporting data (T133566)

Mentioned in SAL [2016-04-28T14:32:14Z] <gehel> wdqs-updater started on wdqs1002 (T133566)