Cassandra3 migration for Analytics AQS
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Nuria
	Apr 8 2020, 6:41 PM

Description

Cassandra3 migration for Analytics AQS

Parent task.

Details

Subject	Repo	Branch	Lines +/-
Add temporary rsync modules to two Cassandra nodes	operations/puppet	production	+30 -0
Add a prometheus scrape target for the aqs_next role	operations/puppet	production	+8 -1
profile::aqs_next: add stub password	labs/private	master	+1 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	BTullis	T249755 Cassandra3 migration for Analytics AQS
Resolved	elukey	T249756 Cassandra3 migration plan proposal
Duplicate	BTullis	T255141 Upgrade the Cassandra AQS cluster to Cassandra 3.11
Resolved	BTullis	T257572 Set up a testing environment for the AQS Cassandra 3 migration
Declined	hnowlan	T278699 AQS Cassandra driver needs to be updated
Resolved	hnowlan	T278701 Store AQS schema and grants in git
Declined	hnowlan	T280155 Dual loading from Hive into old and new AQS clusters
Resolved	JAllemandou	T280649 Add a spark job loading Cassandra 3
Resolved	JAllemandou	T289161 Update cassandra oozie jobs to load cassandra3 using Spark job
Resolved	JAllemandou	T290068 Check AQS with cassandra (serving + data)
Resolved	JAllemandou	T291469 Repair and reload all cassandra-2 data tables but the 2 big ones
Resolved	JAllemandou	T291470 Repair and reload cassandra2 mediarequest_per_file data table
Resolved	JAllemandou	T291473 Test snapshot-reload from all instances using pageview-top data table
Resolved	JAllemandou	T291472 Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances
Resolved	BTullis	T297460 Send cassandra3 (new hosts) logs to logstash
Resolved	BTullis	T297803 Switch over the Cassandra AQS cluster to the new hosts
Resolved	BTullis	T298516 Investigate high levels of garbage collection on new AQS nodes
Resolved	BTullis	T302276 Stop ingesting data to the old AQS cluster
Resolved	Jclark-ctr	T302277 Decommission old AQS cluster nodes
Resolved	Eevans	T302278 Final cleanup tasks related to the AQS cluster migration

Event Timeline

• Nuria created this task.Apr 8 2020, 6:41 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 8 2020, 6:41 PM

Milimetric assigned this task to elukey.Apr 13 2020, 3:47 PM

Milimetric triaged this task as High priority.

Milimetric moved this task from Incoming to Operational Excellence on the Analytics board.

• Nuria closed subtask T249756: Cassandra3 migration plan proposal as Resolved.Aug 5 2020, 11:02 PM

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091058_hnowlan_9533.log.

Completed auto-reimage of hosts:

['aqs1010.eqiad.wmnet']

Of which those FAILED:

['aqs1010.eqiad.wmnet']

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1010.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091139_hnowlan_16200.log.

Completed auto-reimage of hosts:

['aqs1010.eqiad.wmnet']

and were ALL successful.

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1011.eqiad.wmnet', 'aqs1012.eqiad.wmnet', 'aqs1013.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091243_hnowlan_10621.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet']

Of which those FAILED:

['aqs1012.eqiad.wmnet']

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1012.eqiad.wmnet', 'aqs1014.eqiad.wmnet', 'aqs1015.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202103091411_hnowlan_3565.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet', 'aqs1014.eqiad.wmnet', 'aqs1015.eqiad.wmnet']

and were ALL successful.

Change 675174 had a related patch set uploaded (by Hnowlan; author: Hnowlan):
[labs/private@master] profile::aqs_next: add stub password

https://gerrit.wikimedia.org/r/675174

gerritbot added a project: Patch-For-Review.Mar 26 2021, 5:23 PM

Change 675174 merged by Hnowlan:
[labs/private@master] profile::aqs_next: add stub password

https://gerrit.wikimedia.org/r/675174

hnowlan mentioned this in rLPRIf89aa909a325: profile::aqs_next: add stub password.Mar 29 2021, 1:49 PM

Maintenance_bot removed a project: Patch-For-Review.Mar 29 2021, 2:12 PM

hnowlan closed subtask T278699: AQS Cassandra driver needs to be updated as Declined.Apr 14 2021, 3:41 PM

hnowlan closed subtask T280155: Dual loading from Hive into old and new AQS clusters as Resolved.Apr 27 2021, 11:47 AM

hnowlan changed the status of subtask T280155: Dual loading from Hive into old and new AQS clusters from Resolved to Declined.

hnowlan closed subtask T278701: Store AQS schema and grants in git as Resolved.May 11 2021, 11:24 AM

Script wmf-auto-reimage was launched by hnowlan on cumin1001.eqiad.wmnet for hosts:

['aqs1012.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202105180958_hnowlan_17932.log.

Completed auto-reimage of hosts:

['aqs1012.eqiad.wmnet']

and were ALL successful.

elukey removed elukey as the assignee of this task.Jun 1 2021, 8:08 AM

elukey subscribed.

odimitrijevic added a project: Data-Engineering.Aug 12 2021, 5:16 PM

odimitrijevic added a project: Epic.

odimitrijevic moved this task from Incoming (new tickets) to Icebox (not considered in current quarter) on the Data-Engineering board.

odimitrijevic added a project: Data-Engineering-Kanban.Aug 12 2021, 5:26 PM

odimitrijevic moved this task from Next Up to Q2 Epics on the Data-Engineering-Kanban board.

JAllemandou added a subtask: T280649: Add a spark job loading Cassandra 3.Aug 18 2021, 2:41 PM

JAllemandou closed subtask T280649: Add a spark job loading Cassandra 3 as Resolved.Sep 2 2021, 4:11 PM

JAllemandou closed subtask T289161: Update cassandra oozie jobs to load cassandra3 using Spark job as Resolved.

Change 719233 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a promehtheus scrape target for the aqs_new role

https://gerrit.wikimedia.org/r/719233

gerritbot added a project: Patch-For-Review.Sep 7 2021, 9:37 AM

Change 719233 merged by Btullis:

[operations/puppet@production] Add a prometheus scrape target for the aqs_next role

https://gerrit.wikimedia.org/r/719233

Maintenance_bot removed a project: Patch-For-Review.Sep 7 2021, 10:11 AM

Here is the migration plan document: https://docs.google.com/document/d/1FGub_rRIrv77Miadp0Muvf6EwpbvcW2dtZ_qICSt-2o/edit

The snapshot is copying to the new hosts now. We'll be able to keep an eye on progress by monitoring the usage of the /srv/cassandra-a and /srv/cassandra-b filesystems on these two graphs.

Once the copy operation is complete, we will want to perform a validation of the contents of the directories, before restoring the snapshots.

The transfer operation of the snapshots completed successfully.

I have done some additional verification to make sure that the snapshots are consistent.

Taken a complete list of all files in each of the four snapshots, excluding the system tables e.g.

find /srv/cassandra-b/data/local_*/*/snapshots/1631011478226 -type f > aqs1004-cassandra-b-1631011478226-files.txt`

Created SHA1 checksums of 1,000 random files - That's about 1% of the files. e.g.

time nice sha1sum `shuf -n 1000 aqs1004-cassandra-b-1631011478226-files.txt` > aqs1004-cassandra-b-1631011478226-1000-random-files.chk

Modified the lists of files and checksums so that they match the flattened paths of the destination directories. e.g.

cat aqs1004-cassandra-b-1631011478226-files.txt|sed 's/\//@/8; s/\//_/g; s/@/\//; s/1631011478226\//1631011478226\/1631011478226\//' > aqs1004-cassandra-b-1631011478226-files_modified.txt
cat aqs1004-cassandra-b-1631011478226-1000-random-files.chk|sed 's/\//@/8; s/\//_/g; s/@/\//; s/1631011478226\//1631011478226\/1631011478226\//' > aqs1004-cassandra-b-1631011478226-1000-random-files_modified.chk

Copied these files to the destination servers
Checked that every file in the snapshot (excluding the system tables) is present on the destination servers. e.g.

for i in $(cat aqs1004-cassandra-b-1631011478226-files_modified.txt); do if ! [ -f $i ]; then echo $i does not exist; fi ; done

Verified the checksums of each of the 1,000 randomly sampled files e.g.

sha1sum --quiet -c aqs1004-cassandra-b-1631011478226-1000-random-files_modified.chk

Given that these tests pass for all four of the snapshots, plus we understand that transfer.py also carried out checksumming, I'm happy that these are consistent and we can press on with the importing of keyspaces at any time.

The one thing that we have to keep an eye on is the capacity of the drives, given that a snapshot takes up ~60% of each of the two available file systems. Loading a full snapshot might take us up to 90% by my reckoning, so we may need to be careful about this.

hnowlan added a project: Platform Team Workboards (Platform Engineering Reliability).Sep 9 2021, 9:38 AM

Ottomata assigned this task to BTullis.Sep 15 2021, 5:17 PM

Ottomata merged a task: T255141: Upgrade the Cassandra AQS cluster to Cassandra 3.11.

Ottomata edited projects, added Analytics-Clusters; removed Analytics.

Ottomata moved this task from Backlog to Q1 2021/2022 on the Analytics-Clusters board.

Ottomata added a subscriber: hnowlan.

Change 721849 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add temporary rsync modules to two Cassandra nodes

https://gerrit.wikimedia.org/r/721849

gerritbot added a project: Patch-For-Review.Sep 17 2021, 2:56 PM

We have decided to use rsync for the next transfer from the v2 cluster to the v3 cluster.
As such I'm proposing to create temporary rsync modules on the two destination hosts: https://gerrit.wikimedia.org/r/c/operations/puppet/+/721849

Mentioned in SAL (#wikimedia-analytics) [2021-09-17T15:15:01Z] <btullis> btullis@aqs1004:~$ sudo nodetool-a repair --full && sudo nodetool-b repair --full (T249755)

I think that we might need to remove those previously created snapshots, because the usage on aqs1004 and aqs1007 is above 92%.

Mentioned in SAL (#wikimedia-analytics) [2021-09-17T16:03:00Z] <btullis> Cleared all snapshots on aqs100[47] to reclaim space with nodetool-[ab] clearsnapshot (T249755)