Page MenuHomePhabricator

Cassandra table storage backend error in Deployment-prep
Closed, ResolvedPublic

Description

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Cubarific is throwing a 500 error with error "Error in Cassandra table storage backend"

This makes it impossible for the reading web team to test the page previews feature on the beta cluster.

Event Timeline

The Cassandra cluster in deployment-prep is in a Bad Way(tm):

1eevans@deployment-restbase01:~$ nodetool status -r
2Datacenter: datacenter1
3=======================
4Status=Up/Down
5|/ State=Normal/Leaving/Joining/Moving
6-- Address Load Tokens Owns Host ID Rack
7UN deployment-restbase01.deployment-prep.eqiad.wmflabs 1.28 GB 256 ? 190b30fe-66d5-4cd1-b517-a4b2e25a8760 rack1
8?N deployment-aqs01.deployment-prep.eqiad.wmflabs 52.57 MB 256 ? 25af9396-9f12-4d84-9b24-b9b2d9742974 rack1
9?N deployment-aqs03.deployment-prep.eqiad.wmflabs 51.92 MB 256 ? ad17c86d-e842-4ada-b2c3-f8c3f8f7ac8d rack1
10UN deployment-restbase02.deployment-prep.eqiad.wmflabs 1.29 GB 256 ? 0eff3596-b15e-4a3a-b48f-a5269d24cd03 rack1
11
12Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

It seems to have gotten crossed with the AQS cluster. They have the same cluster name (a guard against this outcome), and the aqs nodes list the restbase nodes as seeds (how they found each other).

Change 357515 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] WIP: throwing things against Puppet Compiler to see what sticks

https://gerrit.wikimedia.org/r/357515

Change 357516 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Use cassandraDefaultConsistency: localOne for beta cluster

https://gerrit.wikimedia.org/r/357516

Change 357516 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Use cassandraDefaultConsistency: localOne for beta cluster

https://gerrit.wikimedia.org/r/357516

The immediate issue was fixed by setting the default consistency to localOne for beta cluster. The AQS and rest base cluster still need to be separated though.

Change 357515 abandoned by Eevans:
WIP: throwing things against Puppet Compiler to see what sticks

Reason:
nevermind...

https://gerrit.wikimedia.org/r/357515

Eevans triaged this task as Medium priority.Jun 7 2017, 3:17 PM
Eevans added a subscriber: elukey.

Change 357646 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep

https://gerrit.wikimedia.org/r/357646

Change 357646 merged by Elukey:
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep

https://gerrit.wikimedia.org/r/357646

Change 357649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep

https://gerrit.wikimedia.org/r/357649

Change 357649 merged by Elukey:
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep

https://gerrit.wikimedia.org/r/357649

This is my fault, we did big refactoring to use profile::cassandra in both AQS and Restbase, but after merging I forgot to check deployment-prep. The main problem was that profile::cassandra::instances was picked for both clusters from deployment's prep common.yaml hiera config (that contains only restbase nodes).

Now I fixed the puppet config via Horizon, setting the correct instances for AQS via Puppet prefixes. I corrected some other problems that were waiting for somebody to restart cassandra (see code reviews above), now everything should be set correctly.

Remaining issue - nodetool status still shows up the mixed cluster in both restbase and AQS.

elukey renamed this task from Cassandra table storage backend error to Cassandra table storage backend error in Deployment-prep.Jun 7 2017, 6:07 PM

Remaining issue - nodetool status still shows up the mixed cluster in both restbase and AQS.

Yeah, I think we're going to need to re-init both clusters to get this sorted; Each cluster has a topology that includes the nodes from the other. Additionally, we should change the cluster names (cluster_name in cassandra.yaml) as an additional safeguard.

@elukey I held off on doing this to the restbase cluster today because it is in a sort of half-working/better-than-nothing state. I assume the same is true of aqs, and I didn't want to break things unexpectedly for you. We should sync up on this tomorrow.

I have nuked /var/lib/cassandra/* on deployment-restbase0[12] and started the nodes back again. Things are clean on the RESTBase side now:

$ nodetool -h 10.68.17.189 status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                              Load       Tokens       Owns (effective)  Host ID                               Rack
UN  deployment-restbase01.deployment-prep.eqiad.wmflabs  147.29 KB  256          100.0%            bef2fb6e-ea2c-4b84-a0b6-29c383340e6e  rack1
UN  deployment-restbase02.deployment-prep.eqiad.wmflabs  141.25 KB  256          100.0%            3cf7d335-b518-470d-a43f-cc32f693564f  rack1

And RESTBase there is back in business:

$ curl https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Cubarific | jq .
{
  "title": "Cubarific",
  "displaytitle": "Cubarific",
  "pageid": 184625,
  "extract_html": "",
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hexahedron.jpg/287px-Hexahedron.jpg",
    "width": 287,
    "height": 320,
    "original": "https://upload.wikimedia.org/wikipedia/commons/7/78/Hexahedron.jpg"
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/7/78/Hexahedron.jpg",
    "width": 742,
    "height": 826
  },
  "lang": "en",
  "dir": "ltr",
  "timestamp": "2017-05-15T11:07:46Z"
}

Things are looking good from our side 👍

Resolved?

AQS cluster re-created from scratch, applied a different cassandra cluster name via Puppet prefixes in Horizon to avoid a clash like this one in the future.

Mentioned in SAL (#wikimedia-analytics) [2017-06-08T13:44:56Z] <elukey> AQS cluster in beta wiped and re-bootstrapped due to T167222

Eevans assigned this task to elukey.

With the Puppet configuration fixed and both clusters re-initialized, this should be complete.