Recently I was working on improving Read Performance of Cassandra.
While working I came across some facts and findings so I thought lets
Share It so that we can Explore It.
I have been using Cassandra for over an year and as writes are
sequential they are faster like anything.I have seen faster performance
with sstableloader and BulkOutPutFrormat. But read performance is really a concern and can be tuned with various
parameters and also depends on the way you model the data(database
schema).But If you design it considering how Cassandra works reads can be faster.
When a read request for a
row comes in to a node, the row must be combined from all SSTables on that node
that contain columns from the row in question, as well as from any unflushed
memtables, to produce the requested data. To optimize this piecing-together
process, Cassandra uses an in-memory structure called a Bloom
filter. Each SSTable has a Bloom filter associated with it that checks
if any data for the requested row exists in the SSTable before doing any disk I/O.
As a result, Cassandra is very performant on reads when compared to other
storage systems, even for read-heavy workloads. As with any database, reads are
fastest when the most in-demand data (or hot working set) fits into
memory. Although all modern storage
systems rely on some form of caching to allow for fast access to hot data, not
all of them degrade gracefully when the cache capacity is exceeded and disk I/O
is required. Cassandra's read performance benefits from built-in caching, but it
also does not dip dramatically when random disk seeks are required. When I/O
activity starts to increase in Cassandra due to increased read load, it is easy
to remedy by adding more nodes to the cluster.
For rows that are accessed
frequently, Cassandra has a built-in key cache (and an optional row cache). For
more information about optimizing read performance using the built-in caching
feature.
Cassandra read flow Can be
summarized as follows
For each
key requested it will
- Apply some kind of validation for if the CF exist and all.
- Get all the endpoint where key exist
- Select nearest and better performing endpoint according to endpoint snitch.
- Check if the key is cached in key cache (key cache stores primary index for the key) and get the data from the SSTable if it is in the same SSTable
- If key is not cached for every SSTable it checks if the key exist with the help of the bloom filter.
- If bloom filter says key do not exist then it look for next SSTable.
- If bloom filter says key exist then it access primary index to locate the key location in SSTable. Then it access SSTable to get requested data. But if key is not present in index (false positive) look for the next SSTable. (Please correct me in case I am missing something here)
Factors affecting Cassandra Read Performance:
Following are the factors that affect casandra read performance
Index Interval
Bloom Filter False Positive
Consistency Level
Read Repair Chance
Caching
Compaction
Data modeling
Cluster Deployment
Index Interval
(Default:
128)
Each SSTable has an index file containing row keys and the position at which
that row starts in the data file. At startup, Cassandra reads a sample of that
index into memory. By default 1 row key out of every 128 is sampled.
To
find a row, Cassandra performs a binary search on the sample, then does just
one diskread of the index block corresponding to the closest sampled entry.
The larger the sampling, the more effective the index is (at the cost of memory
usage). A smaller value for this property results in a larger, more effective
index. Generally, a value between 128 and 512 in combination with a large
column family key cache offers the best trade off between memory usage and performance.
You may want to increase the sample size if you have small rows, thus
decreasing the index size and memory usage. For large rows, decreasing the
sample size may improve read performance.
Bloom Filter False Positive
A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free. Bloom filters are surprisingly simple: divide a memory area into buckets (one bit per bucket for a standard bloom filter; more -typically four - for a counting bloom filter). To insert a key, generate several hashes per key, and mark the buckets for each hash. To check if a key is present, check each bucket; if any bucket is empty, the key was never inserted in the filter. If all buckets are non-empty, though, the key is only probably inserted - other keys' hashes could have covered the same buckets. See All you ever wanted to know about writing bloom filters for details and in particular why getting a really good output distribution is important.
bloom_filter_fp_chance
(Default: ~ 0.0007) Desired false-positive probability for SSTable Bloomfilters.Valid values are 0.0001 to 1.0. At 1.0 the Bloom filter is effectively disabled; this is reasonable if you are using LeveledCompactionStrategy and not querying for non-existent rows. If you have many keys per node and are worried about Bloom filter memory usage, a reasonable first step is to try 0.01, if that is still too large then 0.1
Consistency Level
Consistency levels in Cassandra can be set on any read or
write query. This allows application developers to tune consistency on a
per-query basis depending on their requirements for response time versus data
accuracy. Cassandra offers a number of consistency levels for both reads and
writesWhen you do a read in Cassandra, the consistency level
specifies how many replicas must respond before a result is returned to the
client application.Cassandra checks the specified number of replicas for the
most recent data to satisfy the read request (based on the timestamp).The following consistency levels are available, with ONE
being the lowest consistency (but highest availability), and ALL being the
highest consistency (but lowest availability). QUORUM is a good middle-ground
ensuring strong consistency, yet still tolerating some level of failure.A quorum is calculated as (rounded down to a whole
number):
(replication_factor / 2) + 1
For example, with a replication factor of 3, a quorum is
2 (can tolerate 1 replica down). With a replication factor of 6, a quorum is 4
(can tolerate 2 replicas down)Choosing a consistency level for reads and writes
involves determining your requirements for consistent results (always reading
the most recently written data) versus read or write latency (the time it takes
for the requested data to be returned or for the write to succeed).If latency is a top priority, consider a consistency
level of ONE (only one replica node must successfully respond to the read or
write request). There is a higher probability of stale data being read with
this consistency level (as the replicas contacted for reads may not always have the most recent write). For some
applications, this may be an acceptable trade-off. If it is an absolute
requirement that a write never fail, you may also consider a write consistency
level of ANY.This consistency level has the highest probability of a
read not returning the latest written values (see hinted handoff).If consistency is top priority, you can ensure that a
read will always reflect the most recent write by using the following formula:
(nodes_written + nodes_read) > replication_factor
For example, if your application is using the QUORUM
consistency level for both write and read operations and you are using a
replication factor of 3, then this ensures that 2 nodes are always written and
2 nodes are always read. The combination of nodes written and read (4) being
greater than the replication factor (3) ensures strong read consistency.
Read Repair Chance
Read repair means that when a query
is made against a given key, we perform a digest query against
all the replicas of the key and push the most recent version to any out-of-date
replicas. If a lower ConsistencyLevel than ALL was specified, this is done in
the background after returning the data from the closest replica to the client;
otherwise, it is done before returning the data.
This means that in almost all cases,
at most the first instance of a query will return old data.
Range scans are not per-key and do
not do read repair. (A range scan at CL > ONE *will* reconcile differences
in replicas required to achive the given CL, but extra replicas are not
compared in the background.)
read_repair_chance
read_repair_chance
(Default: 0.1) Specifies the probability with
which read repairs should be invoked on non-quorum reads. The value must be between 0 and 1. A value of 0.1 means that a
read repair is performed 10% of the time and a value of 1 means that a read repair is performed 100% of the time. Lower
values improve read throughput, but increase the chances of stale values when not using a strong consistency level.
Caching
Tuning Data Caches
These caches are built into Cassandra and provide very efficient
data caching:
• Key cache: a cache of the primary key index for a
Cassandra table. Enabled by default.
• Row cache: similar to a traditional cache like memcached. Holds
the entire row in memory so reads can be satisfied without using disk. Disabled
by default.
If read performance is critical, you can leverage the built-in
caching to effectively pry dedicated caching tools, such as memcached,
completely out of the stack. Such deployments remove a redundant layer and
strengthen cache functionality in the lower tier where the data is already
being stored. Caching never needs to be restarted in a completely cold state.
With proper tuning, key cache hit rates of 85% or better are possible with Cassandra, and each hit on a key cache can save one disk seek per SSTable. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. When growth in the read load begins to impact your hit rates, you can add capacity to restore optimal levels of caching. Typically, expect a 90% hit rate for row caches. If row cache hit rates are 30% or lower, it may make more sense to leave row caching disabled (the default). Using only the key cache makes the row cache available for other column families that need it.
With proper tuning, key cache hit rates of 85% or better are possible with Cassandra, and each hit on a key cache can save one disk seek per SSTable. Row caching, when feasible, can save the system from performing any disk seeks at all when fetching a cached row. When growth in the read load begins to impact your hit rates, you can add capacity to restore optimal levels of caching. Typically, expect a 90% hit rate for row caches. If row cache hit rates are 30% or lower, it may make more sense to leave row caching disabled (the default). Using only the key cache makes the row cache available for other column families that need it.
How Caching Works
When both row and key caches are configured, the row cache returns
results whenever possible. In the event of a row cache miss, the key cache
might still provide a hit that makes the disk seek much more efficient. This
diagram depicts two read operations on a column family with both caches already
populated.
One read operation hits the row cache, returning the requested row
without a disk seek. The other read operation requests a row that is not
present in the row cache but is present in the key cache. After accessing the
row in the SSTable, the system returns the data and populates the row cache
with this read operation.
When to Use Key Caching
Because the key cache holds the location of keys in memory on a
per-column family basis, turning this value up can have an immediate, positive
impact on column family reads as soon as the cache warms.
High levels of key caching are recommended for most scenarios.
Cases for row caching are more pecialized,
but whenever it can coexist peacefully with other demands on memory resources,
row caching provides the most dramatic gains in efficiency.
Using the default key cache setting, or a higher one, works well
in most cases. Tune key cache sizes in conjunction with the Java
heap size.
When to Use Row Caching
Row caching is recommended in these cases:
• Data access patterns follow a normal (Gaussian) distribution.
• Rows contain heavily-read data and queries frequently return
data from most or all of the columns.
General Cache Usage Tips
Some tips for efficient cache use are:
• Store lower-demand data or data with extremely long rows in a
column family with minimal or no caching.
• Deploy a large number of Cassandra nodes under a relatively
light load per node.
• Logically separate heavily-read data into discrete column
families.
Cassandra's memtables have overhead for index structures on top of
the actual data they store. If the size of the values stored in the
heavily-read columns is small compared to the number of columns and rows
themselves (long, narrow rows), this overhead can be substantial. Short, narrow
rows, on the other hand, lend themselves to highly efficient row caching.
Enabling the Key and Row Caches
Enable the key and row caches at the column family level using the
CQL caching parameter. Unlike earlier Cassandra versions,
cache sizes do not need to be specified per table. Just set caching to all,
keys_only, rows_only, or none, and Cassandra weights the cached data by size
and access frequency, and thus make optimal use of the cache memory without
manual tuning. For archived tables, disable caching entirely
because these tables are read infrequently.
Setting Cache Options
In the cassandra.yaml file, tune caching by changing
these options:
• key_cache_size_in_mb: The capacity in megabytes of
all key caches on the node.
• row_cache_size_in_mb: The capacity in megabytes of
all row caches on the node.
• key_cache_save_period: How often to save the key
caches to disk.
• row_cache_save_period: How often to save the key
caches to disk.
• row_cache_provider: The implementation used for row
caches on the node.
Monitoring Cache Tune Ups
Make changes to cache options in small, incremental adjustments,
then monitor the effects of each change using one of the following tools:
• nodetool cfstats
• JConsole
About the Off-Heap Row Cache
Cassandra can store cached rows in native memory, outside the Java
heap. This results in both a smaller per-row memory footprint and reduced JVM
heap requirements, which helps keep the heap size in the sweet spot for JVM garbage
collection performance.
Using
the off-heap row cache requires the JNA library to be installed; otherwise,
Cassandra falls back on the on-heap cache provider.
Compaction
In
the background, Cassandra periodically merges SSTables together into larger
SSTables using a process called compaction. Compaction merges row
fragments together, removes expired tombstones (deleted columns), and rebuilds primary
and secondary indexes. Since the SSTables are sorted by row key, this merge is
efficient (no random disk I/O). Once a newly merged SSTable is complete, the
input SSTables are marked as obsolete and eventually deleted by the JVM garbage
collection (GC) process. However, during compaction, there is a temporary spike
in disk space usage and disk I/O.
Compaction impacts
read performance in two ways. While a compaction is in progress, it temporarily
increases disk I/O and disk utilization which can impact read performance for
reads that are not fulfilled by the cache. However, after a compaction has been
completed, off-cache read performance improves since there are fewer SSTable
files on disk that need to be checked in order to complete a read request.
Data modeling
Planning a data model in Cassandra involves different design
considerations than you may be used to if you work with relational databases.
Ultimately, the data model you design depends on the data you want to capture
and how you plan to access it. However, there are some common design
considerations for Cassandra data model planning.
Start with Queries
The best way to approach data modeling for Cassandra is to start
with your queries and work back from there. Think about the actions your
application needs to perform, how you want to access the data, and then design
column families to support those access patterns. A good rule of a thumb is one
column family per query since you optimize column families for read
performance.
For example, start with listing the use cases your application
needs to support. Think about the data you want to capture and the lookups your
application needs to do. Also note any ordering, filtering or grouping
requirements. For example, needing events in chronological order or considering
only the last 6 months worth of data would be factors in your data model
design.
Denormalize to Optimize
In the relational world, the data model is usually designed up front
with the goal of normalizing the data to minimize redundancy. Normalization
typically involves creating smaller, well-structured tables and then defining
relationships between them. During queries, related tables are joined to
satisfy the request.
Cassandra
does not have foreign key relationships like a relational database does, which
means you cannot join multiple column families to satisfy a given query
request. Cassandra performs best when the data needed to satisfy a given query
is located in the same column family. Try to plan your data model so
that one or more rows in a single column family are used to answer each query.
This sacrifices disk space (one of the cheapest resources for a server) in order
to reduce the number of disk seeks and the amount of network traffic.
Cluster Deployment
When planning a Cassandra cluster deployment, you should have a good idea of the initial volume of data you plan to store and a good estimate of your typical application workload.
Selecting Hardware for Enterprise Implementations
As with any application, choosing appropriate hardware depends on selecting the right balance of the following resources: memory, CPU, disks, number of nodes, and network.
Memory
The more memory a Cassandra node has, the better read performance. More RAM allows for larger cache sizes and reduces disk I/O for reads. More RAM also allows memory tables (memtables) to hold more recently written data. Larger memtables lead to a fewer number of SSTables being flushed to disk and fewer files to scan during a read. The ideal amount of RAM depends on the anticipated size of your hot data.
• For dedicated hardware, a minimum of than 8GB of RAM is needed. DataStax recommends 16GB - 32GB.
• Java heap space should be set to a maximum of 8GB or half of your total RAM, whichever is lower. (A greater heap size has more intense garbage collection periods.)
• For a virtual environment use a minimum of 4GB, such as Amazon EC2 Large instances. For production clusters with a healthy amount of traffic, 8GB is more common.
CPU
Insert-heavy workloads are CPU-bound in Cassandra before becoming memory-bound. Cassandra is highly concurrent and uses as many CPU cores as available.
• For dedicated hardware, 8-core processors are the current price-performance sweet spot.
• For virtual environments, consider using a provider that allows CPU bursting, such as Rackspace cloud Servers.
Disk
What you need for your environment depends a lot on the usage, so it's important to understand the mechanism.
Cassandra writes data to disk for two purposes:
• All data is written to the commit log for durability.
• When thresholds are reached, Cassandra periodically flushes in-memory data structures (memtables) to SSTable data files for persistent storage of column family data.
Commit logs receive every write made to a Cassandra node, but are only read during node start up. Commit logs are purged after the corresponding data is flushed. Conversely, SSTable (data file) writes occur asynchronously and are read during client look-ups. Additionally, SSTables are periodically compacted. Compaction improves performance by merging and rewriting data and discarding old data. However, during compaction (or node repair), disk utilization and data directory volume can substantially increase. For this reason, DataStax recommends leaving an adequate amount of free disk space available on a node (50% [worst case] for tiered compaction, 10% for leveled compaction).
Recommendations:
• When choosing disks, consider both capacity (how much data you plan to store) and I/O (the write/read throughput rate). Most workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).
• Solid-state drives (SSDs) are also a valid alternative for Cassandra. Cassandra's sequential, streaming write patterns minimize the undesirable effects of write amplification associated with SSDs.
• Ideally Cassandra needs at least two disks, one for the commit log and the other for the data directories. At a minimum the commit log should be on its own partition.
• Commit log disk - this disk does not need to be large, but it should be fast enough to receive all of your writes as appends (sequential I/O).
• Data disks - use one or more disks and make sure they are large enough for the data volume and fast enough to both satisfy reads that are not cached in memory and to keep up with compaction.
• RAID - compaction can temporarily require up to 100% of the free in-use disk space on a single data directory volume. This means when approaching 50% of disk capacity, you should use RAID 0 or RAID 10 for your data directory volumes. RAID also helps smooth out I/O hotspots within a single SSTable.
• Use RAID0 if disk capacity is a bottleneck and rely on Cassandra's replication capabilities for disk failure tolerance. If you lose a disk on a node, you can recover lost data through Cassandra's built-in repair.
• Use RAID10 to avoid large repair operations after a single disk failure, or if you have disk capacity to spare.
• Because data is stored in the memtable, generally RAID is not needed for the commit log disk, but if you need the extra redundancy, use RAID 1.
• Extended file systems - On ext2 or ext3, the maximum file size is 2TB even using a 64-bit kernel. On ext4 it is 16TB. Because Cassandra can use almost half your disk space for a single file, use XFS when raiding large disks together, particularly if using a 32-bit kernel. XFS file size limits are 16TB max on a 32-bit kernel, and essentially unlimited on 64-bit.
Number of Nodes
The amount of data on each disk in the array isn't as important as the total size per node. Using a greater number of smaller nodes is better than using fewer larger nodes because of potential bottlenecks on larger nodes during compaction.
Network
Since Cassandra is a distributed data store, it puts load on the network to handle read/write requests and replication of data across nodes. Be sure to choose reliable, redundant network interfaces and make sure that your network can handle traffic between nodes without bottlenecksT.
• Recommended bandwith is 1000 Mbit/s (Gigabit) or greater.
• Bind the Thrift interface (listen_address) to a specific NIC (Network Interface Card).
• Bind the RPC server inteface (rpc_address) to another NIC.
Cassandra is efficient at routing requests to replicas that are geographically closest to the coordinator node handling the request. Cassandra will pick a replica in the same rack if possible, and will choose replicas located in the same data center over replicas in a remote data center.
Firewall
If using a firewall, make sure that nodes within a cluster can reach each other on these ports.