Cassandra table storage backend error in Deployment-prep
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Jdlrobson
	Jun 6 2017, 9:45 PM

Description

https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Cubarific is throwing a 500 error with error "Error in Cassandra table storage backend"

This makes it impossible for the reading web team to test the page previews feature on the beta cluster.

Details

Subject	Repo	Branch	Lines +/-
Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep	operations/puppet	production	+1 -1
Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep	operations/puppet	production	+1 -1
WIP: throwing things against Puppet Compiler to see what sticks	operations/puppet	production	+3 -3
Use cassandraDefaultConsistency: localOne for beta cluster	mediawiki/services/restbase/deploy	master	+1 -0

Customize query in gerrit

Related Objects

Mentioned In: T167358: Popups aren't working in 1.30.0-wmf.4
Mentioned Here: P5552 (An Untitled Masterwork)

Event Timeline

Jdlrobson created this task.Jun 6 2017, 9:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 6 2017, 9:45 PM

The Cassandra cluster in deployment-prep is in a Bad Way(tm):

P5552 (An Untitled Masterwork)

1	eevans@deployment-restbase01:~$ nodetool status -r
2	Datacenter: datacenter1
3	=======================
4	Status=Up/Down
5	\|/ State=Normal/Leaving/Joining/Moving
6	-- Address Load Tokens Owns Host ID Rack
7	UN deployment-restbase01.deployment-prep.eqiad.wmflabs 1.28 GB 256 ? 190b30fe-66d5-4cd1-b517-a4b2e25a8760 rack1
8	?N deployment-aqs01.deployment-prep.eqiad.wmflabs 52.57 MB 256 ? 25af9396-9f12-4d84-9b24-b9b2d9742974 rack1
9	?N deployment-aqs03.deployment-prep.eqiad.wmflabs 51.92 MB 256 ? ad17c86d-e842-4ada-b2c3-f8c3f8f7ac8d rack1
10	UN deployment-restbase02.deployment-prep.eqiad.wmflabs 1.29 GB 256 ? 0eff3596-b15e-4a3a-b48f-a5269d24cd03 rack1
11
12	Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless

It seems to have gotten crossed with the AQS cluster. They have the same cluster name (a guard against this outcome), and the aqs nodes list the restbase nodes as seeds (how they found each other).

Jdlrobson moved this task from Incoming to Needs Prioritization on the Web-Team-Backlog board.Jun 6 2017, 10:02 PM

Change 357515 had a related patch set uploaded (by Eevans; owner: Eevans):
[operations/puppet@production] WIP: throwing things against Puppet Compiler to see what sticks

https://gerrit.wikimedia.org/r/357515

gerritbot added a project: Patch-For-Review.Jun 6 2017, 10:10 PM

Change 357516 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/services/restbase/deploy@master] Use cassandraDefaultConsistency: localOne for beta cluster

https://gerrit.wikimedia.org/r/357516

Change 357516 merged by Ppchelko:
[mediawiki/services/restbase/deploy@master] Use cassandraDefaultConsistency: localOne for beta cluster

https://gerrit.wikimedia.org/r/357516

The immediate issue was fixed by setting the default consistency to localOne for beta cluster. The AQS and rest base cluster still need to be separated though.

Change 357515 abandoned by Eevans:
WIP: throwing things against Puppet Compiler to see what sticks

Reason:
nevermind...

https://gerrit.wikimedia.org/r/357515

Eevans added a subscriber: Joe.Jun 7 2017, 12:50 AM

ovasileva subscribed.Jun 7 2017, 11:04 AM

Eevans triaged this task as Medium priority.Jun 7 2017, 3:17 PM

Eevans added a subscriber: elukey.

Jdlrobson moved this task from Needs Prioritization to 2014-15 Q4 on the Web-Team-Backlog board.Jun 7 2017, 4:29 PM

Change 357646 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep

https://gerrit.wikimedia.org/r/357646

Change 357646 merged by Elukey:
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent jar path for deployment-prep

https://gerrit.wikimedia.org/r/357646

Change 357649 had a related patch set uploaded (by Elukey; owner: Elukey):
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep

https://gerrit.wikimedia.org/r/357649

Change 357649 merged by Elukey:
[operations/puppet@production] Fix cassandra's jmx_prometheus_javaagent config path for deployment-prep

https://gerrit.wikimedia.org/r/357649

This is my fault, we did big refactoring to use profile::cassandra in both AQS and Restbase, but after merging I forgot to check deployment-prep. The main problem was that profile::cassandra::instances was picked for both clusters from deployment's prep common.yaml hiera config (that contains only restbase nodes).

Now I fixed the puppet config via Horizon, setting the correct instances for AQS via Puppet prefixes. I corrected some other problems that were waiting for somebody to restart cassandra (see code reviews above), now everything should be set correctly.

Remaining issue - nodetool status still shows up the mixed cluster in both restbase and AQS.

elukey renamed this task from Cassandra table storage backend error to Cassandra table storage backend error in Deployment-prep.Jun 7 2017, 6:07 PM

In T167222#3325105, @elukey wrote:

Remaining issue - nodetool status still shows up the mixed cluster in both restbase and AQS.

Yeah, I think we're going to need to re-init both clusters to get this sorted; Each cluster has a topology that includes the nodes from the other. Additionally, we should change the cluster names (cluster_name in cassandra.yaml) as an additional safeguard.

@elukey I held off on doing this to the restbase cluster today because it is in a sort of half-working/better-than-nothing state. I assume the same is true of aqs, and I didn't want to break things unexpectedly for you. We should sync up on this tomorrow.

Jdlrobson mentioned this in T167358: Popups aren't working in 1.30.0-wmf.4.Jun 7 2017, 11:01 PM

• bearND subscribed.Jun 8 2017, 6:34 AM

I have nuked /var/lib/cassandra/* on deployment-restbase0[12] and started the nodes back again. Things are clean on the RESTBase side now:

$ nodetool -h 10.68.17.189 status -r
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address                                              Load       Tokens       Owns (effective)  Host ID                               Rack
UN  deployment-restbase01.deployment-prep.eqiad.wmflabs  147.29 KB  256          100.0%            bef2fb6e-ea2c-4b84-a0b6-29c383340e6e  rack1
UN  deployment-restbase02.deployment-prep.eqiad.wmflabs  141.25 KB  256          100.0%            3cf7d335-b518-470d-a43f-cc32f693564f  rack1

And RESTBase there is back in business:

$ curl https://en.wikipedia.beta.wmflabs.org/api/rest_v1/page/summary/Cubarific | jq .
{
  "title": "Cubarific",
  "displaytitle": "Cubarific",
  "pageid": 184625,
  "extract_html": "",
  "thumbnail": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Hexahedron.jpg/287px-Hexahedron.jpg",
    "width": 287,
    "height": 320,
    "original": "https://upload.wikimedia.org/wikipedia/commons/7/78/Hexahedron.jpg"
  },
  "originalimage": {
    "source": "https://upload.wikimedia.org/wikipedia/commons/7/78/Hexahedron.jpg",
    "width": 742,
    "height": 826
  },
  "lang": "en",
  "dir": "ltr",
  "timestamp": "2017-05-15T11:07:46Z"
}

phuedx subscribed.Jun 8 2017, 1:17 PM

Things are looking good from our side 👍

Resolved?

AQS cluster re-created from scratch, applied a different cassandra cluster name via Puppet prefixes in Horizon to avoid a clash like this one in the future.

Mentioned in SAL (#wikimedia-analytics) [2017-06-08T13:44:56Z] <elukey> AQS cluster in beta wiped and re-bootstrapped due to T167222

With the Puppet configuration fixed and both clusters re-initialized, this should be complete.

MBinder_WMF moved this task from 2014-15 Q4 to 2016-17 Q4 on the Web-Team-Backlog board.Mar 6 2018, 6:57 PM

Cassandra table storage backend error in Deployment-prepClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

Cassandra table storage backend error in Deployment-prep
Closed, ResolvedPublic
Actions