Page MenuHomePhabricator

Cassandra3 migration for Analytics AQS
Closed, ResolvedPublic

Description

Cassandra3 migration for Analytics AQS

Parent task.

Related Objects

StatusSubtypeAssignedTask
ResolvedBTullis
Resolvedelukey
DuplicateBTullis
ResolvedBTullis
Declinedhnowlan
Resolvedhnowlan
Declinedhnowlan
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedJAllemandou
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedBTullis
ResolvedJclark-ctr
ResolvedEevans

Event Timeline

Milimetric triaged this task as High priority.
Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091058_hnowlan_9533.log.

Completed auto-reimage of hosts:

['aqs1010.eqiad.wmnet']

Of which those FAILED:

['aqs1010.eqiad.wmnet']

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091139_hnowlan_16200.log.

Completed auto-reimage of hosts:

['aqs1010.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1011.eqiad.wmnet', 'aqs1012.eqiad.wmnet', 'aqs1013.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091243_hnowlan_10621.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet']

Of which those FAILED:

['aqs1012.eqiad.wmnet']

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1012.eqiad.wmnet', 'aqs1014.eqiad.wmnet', 'aqs1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091411_hnowlan_3565.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet', 'aqs1014.eqiad.wmnet', 'aqs1015.eqiad.wmnet']

and were ALL successful.

Change 675174 had a related patch set uploaded (by Hnowlan; author: Hnowlan):
[labs/private@master] profile::aqs_next: add stub password

https://gerrit.wikimedia.org/r/675174

Change 675174 merged by Hnowlan:
[labs/private@master] profile::aqs_next: add stub password

https://gerrit.wikimedia.org/r/675174

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1012.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105180958_hnowlan_17932.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet']

and were ALL successful.

elukey removed elukey as the assignee of this task.Jun 1 2021, 8:08 AM
elukey subscribed.

Change 719233 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a promehtheus scrape target for the aqs_new role

https://gerrit.wikimedia.org/r/719233

Change 719233 merged by Btullis:

[operations/puppet@production] Add a prometheus scrape target for the aqs_next role

https://gerrit.wikimedia.org/r/719233

Here is the migration plan document: https://docs.google.com/document/d/1FGub_rRIrv77Miadp0Muvf6EwpbvcW2dtZ_qICSt-2o/edit

The snapshot is copying to the new hosts now. We'll be able to keep an eye on progress by monitoring the usage of the /srv/cassandra-a and /srv/cassandra-b filesystems on these two graphs.

image.png (814×1 px, 67 KB)

Once the copy operation is complete, we will want to perform a validation of the contents of the directories, before restoring the snapshots.

The transfer operation of the snapshots completed successfully.

I have done some additional verification to make sure that the snapshots are consistent.

  • Taken a complete list of all files in each of the four snapshots, excluding the system tables e.g.
find /srv/cassandra-b/data/local_*/*/snapshots/1631011478226 -type f > aqs1004-cassandra-b-1631011478226-files.txt`
  • Created SHA1 checksums of 1,000 random files - That's about 1% of the files. e.g.
time nice sha1sum `shuf -n 1000 aqs1004-cassandra-b-1631011478226-files.txt` > aqs1004-cassandra-b-1631011478226-1000-random-files.chk
  • Modified the lists of files and checksums so that they match the flattened paths of the destination directories. e.g.
cat aqs1004-cassandra-b-1631011478226-files.txt|sed 's/\//@/8; s/\//_/g; s/@/\//; s/1631011478226\//1631011478226\/1631011478226\//' > aqs1004-cassandra-b-1631011478226-files_modified.txt
cat aqs1004-cassandra-b-1631011478226-1000-random-files.chk|sed 's/\//@/8; s/\//_/g; s/@/\//; s/1631011478226\//1631011478226\/1631011478226\//' > aqs1004-cassandra-b-1631011478226-1000-random-files_modified.chk
  • Copied these files to the destination servers
  • Checked that every file in the snapshot (excluding the system tables) is present on the destination servers. e.g.
for i in $(cat aqs1004-cassandra-b-1631011478226-files_modified.txt); do if ! [ -f $i ]; then echo $i does not exist; fi ; done
  • Verified the checksums of each of the 1,000 randomly sampled files e.g.
sha1sum --quiet -c aqs1004-cassandra-b-1631011478226-1000-random-files_modified.chk

Given that these tests pass for all four of the snapshots, plus we understand that transfer.py also carried out checksumming, I'm happy that these are consistent and we can press on with the importing of keyspaces at any time.

The one thing that we have to keep an eye on is the capacity of the drives, given that a snapshot takes up ~60% of each of the two available file systems. Loading a full snapshot might take us up to 90% by my reckoning, so we may need to be careful about this.

Change 721849 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add temporary rsync modules to two Cassandra nodes

https://gerrit.wikimedia.org/r/721849

We have decided to use rsync for the next transfer from the v2 cluster to the v3 cluster.
As such I'm proposing to create temporary rsync modules on the two destination hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721849

Mentioned in SAL (#wikimedia-analytics) [2021-09-17T15:15:01Z] <btullis> btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)

I think that we might need to remove those previously created snapshots, because the usage on aqs1004 and aqs1007 is above 92%.

image.png (801×1 px, 98 KB)

Mentioned in SAL (#wikimedia-analytics) [2021-09-17T16:03:00Z] <btullis> Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)

Change 721849 merged by Btullis:

[operations/puppet@production] Add temporary rsync modules to two Cassandra nodes

https://gerrit.wikimedia.org/r/721849

BTullis moved this task from Backlog to Complete on the Cassandra board.

Ding ding! Resolved an epic.