Below I have mentioned some of the issues that I have faced with Cassandra cluster of 24 and 36 nodes running Cassandra-0.8.2.We have this cluster hosted on Amazons EC2 . Each Cassndra node is runnig OS Ubuntu lucid with 16GiB of RAM and 1TB of EBS storage.
6)
Dead node always appear in describe cluster
1) Some of the nodes are unreachable even if all Cassandra processes are running.
All
the nodes in the cluster are up and running. It shows “Up” status in Ring and on all the nodes Cassandra processes are
running still when we do “describe
cluster” It shows some of the nodes
unreachable.
·
We figured out this issue during the data
loading with sstableloader . As these
nodes are unreachable sstableloader is not able to load the data. Not able to
figure out proper cause and solution for this, but may be because of heavy load
or I/O these nodes do not respond. Once the Cassandra is restarted on these
nodes and they become reachable.
2) Streaming get stuck while increasing Cassandra cluster capacity.
Unable To add nodes to existing
Cassandra cluster.When we were doubling Cassandra cluster capacity from 6 nodes
to 12 nodes . Some nodes stopped streaming after some time and were not able to make any progress.
·
While increasing cluster capacity we introduces
new nodes with “autobootstrap = true” in
‘cassandra .yaml ‘.This results into new nodes streaming data from surrounding
nodes. Very first things you can always stop this streaming, clean data
directories and restart node. There is no data loss.
·
You can check into logs for cause of issue. In
our case it was a version mismatch error.We had upgraded the Cassandra from
0.7.2 to 0.8.2. But the data for some column families which we are loading
directly from hadoop job was not upgraded and these column families were
causing problem. So we ran “nodetool
scrub” on these column families for all the 6 old nodes and then started
streaming.
3) sstableloader get stuck after loading some amount of data.
All nodes are up and running, with
up status in ring and reachable with describe cluster still loader is not able
to load data or get stuck after loading some amount of data.
·
Recently I am experiencing this issue and data
load on all machines are up to 700GB.
· Not able to figure out the cause and solution
for this issue, But it seems very difficult to figure out which nodes are
creating problem.
4) nodetool cleanup corrupting sstables
After adding nodes to cluster it
streams data from surrounding nodes.To clean the unnecessary data we need to
run “nodetool cleanup” on those nodes
.But we have experience that after cleanup sstables on those machines are
corrupt.
·
We noticed that the there were always pending
task after compaction (minor and major).So we checked logs and it was showing
that compaction manager were unable to compact sstables .
·
So only solution for this to rebuild sstables.
So ran scrub on all the nodes having corrupt sstables. And now compaction is running successfully .But
all this scrub process is very slow and you should not load data to Cassandra
while scrub is in progress.
·
More on this: https://issues.apache.org/jira/browse/CASSANDRA-3923
5) Compaction and scrub is not getting triggered because of error “too many open files”
·
We have experienced this issue because of jna
jar was not in place and after adding this jar were able to trigger scrub and compaction.
·
As we were facing issue during compaction (major
and minor ) the no of sstables on Cassandra machines went high and even after
increasing open files limit we were not able to trigger compaction. We have
experienced this issue in Cassandra-0.8.2 and as per Cassandra experts this is
a bug in and it is fixed in 0.8.10 onwards.
·
We have figured out a temporary solution for
this. Load the data on these nodes with minor compaction on. This will reduce
the no of sstables and you can trigger the major compaction or scrub.
6)
Dead node always appear in describe cluster
If
some of the node is dead and if we cannot have machine with same IP address it
is not possible to get rid of this machine and it will always appear in
describe cluster as a unreachable node
and won’t allow us to load sstables through sstable loader
·
This is the worst case and looks like a bug in
Cassandra 0.8.x. As there is no proper way to remove this node we were deleting
location info files from Cassandra followed by restart. This worked for us but
a very risky stuff.
7) Once we trigger a compaction some of nodes were dying with Out Of Memory error.
From datastax I got
If nodes are dying with OutOfMemory exceptions, there are four typical reasons for this:
If nodes are dying with OutOfMemory exceptions, there are four typical reasons for this:
·
A row has grown too large
o
A row cannot exceed 2GB, and must be compacted
in memory. The RowWarningThresholdInMB
directive will log which rows have exceeded the set threshold. Keep in mind
that if you have had nodes down for a period of time, the row that may have
grown too large could be due to HintedHandoff.
·
Row cache is too large, or is caching large rows
o
Row cache is generally a high-end optimization.
Try disabling it and see if the OOM problems continue.
·
Writes are using ConsistencyLevel.ZERO
o
ConsistencyLevel.ZERO queues writes at the
coordinator node and returns instantly. It should generally not be used.
·
The memtable sizes are too large for the amount
of heap allocated to the JVM
o
Up to 3 memtables can be resident in memory per
ColumnFamily, and adding another 1GB on top of that for Cassandra itself is a
good estimate of total heap usage.
In our case I just restarted Cassandra
and tried to trigger compaction again and this time also node died with OOM error.
So I did restart it and ran scrub for the column family for which I was trying
to trigger compaction. After scrub compaction completed successfully.