This document might not be correct.
What is Cassandra?
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.
…
Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.
Goal
- Understand concepts of Cassandra
- Understand how Cassandra reads, writes, and deletes data
- Understand how our configuration about Cassandra
Words
Node
→ Where you store your data. It is the basic infrastructure component of Cassandra.
- Coordinator node:
When a request is sent to any Cassandra node, this node acts as a proxy for the application (actually, the Cassandra driver) and the nodes involved in the request flow. This proxy node is called as the coordinator. (ref. https://medium.com/jorgeacetozi/cassandra-architecture-and-write-path-anatomy-51e339bcfe0c)
Datacenter
→ A collection of related nodes. A datacenter can be a physical datacenter or virtual datacenter.
Cluster
→ A cluster contains one or more datacenters. It can span physical locations.
Cassandra Data model
- Columns - name-value pair
- Row - container for columns referenced by primary key
- Table - container for rows
- Keyspace - container for tables that span one or more nodes (It is like a database in relational database system)
“Keys” in Cassandra
-
Primary key = partition key + clustering key
- Partition key decides how data is distributed across nodes
- Clustering key decides how data is stored on a single node
How data is distributed?
- A hashing function called partitioner is used to generate token values
- partition keys are used to generate the token
- Every node is assigned a unique token/range of token that determines which row will go to which node.
Cassandra Architecture
Replication
Replication is necessary for high availability and we can decide how many nodes will have the same data (replication factor) and how data is replicated (replication strategy).
About replication strategy and factor
Note: With gossip protocol and snitch, each node know which node has the same data
Write consistency
The consistency in Cassandra is tunable. It means we can change the consistency of data.
- Increase or decrease consistency
- Trade-off between consistency & performance
- Configure consistency for read & write separately
We can set
-
ONE
→ A write must be written to the commit log and memtable of at least one replica node.
-
ALL
→ A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition.
-
QUORAM
→ A write must be written to the commit log and memtable on a quorum of replica nodes across all datacenters. -
LOCAL_QUORAM
→ A write must be written to the commit log and memtable on a quorum of replica nodes in the same datacenter as the coordinator.
Read consistency
Basically, it is similar to the write consistency. You can refer to the details in https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/dbIntConfigConsistency.html#Readconsistencylevels.
Storage
CommitLogs
Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.
- Any data written will first be written to a commit log
Memtables
Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. Eventually, memtables are flushed onto disk and become immutable SSTables.
This can be triggered in several ways:
-
The memory usage of the memtables exceeds the configured threshold (see memtable_cleanup_threshold)
-
The CommitLog approaches its maximum size, and forces memtable flushes in order to allow commitlog segments to be freed
-
In-memory structures where Cassandra buffers write
-
It is created from commitlog
-
Eventually, memtables are flushed onto disk and become immutable SSTables
SSTable
SSTables are the immutable data files that Cassandra uses for persisting data on disk.
As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.
- SSTables are immutable data files and they are used for persisting data on disk
What is the difference between commitlog and sstable?
The commitlogs are like files that records write operations synchronously (it is also optimized for writing). On the other hand, sstables are immutable data files on disk and that are created asynchronously.
Without commitlog, what happens if node is down before it flushes memtables into sstables ?
→ We might lose data because some of the data exist only in memtables.
However, in this case, if we have commitlog, we can make the data persistent with sstables.
You can refer to the details in https://stackoverflow.com/questions/34592948/what-is-the-purpose-of-cassandras-commit-log.
How write happens?
Write operation is simpler than read operation
- Write operation
- It is written in commitlog
- Memtables are generated from the commitlog
- Eventually, memtables are flushed onto disk and immutable sstables are created
How read happens?
Read operation is complicated because it may require data in memtables and some of sstables.
-
Row cache
→ subsets of rows in memory -
Bloom filter→ Each SSTable has bloom filter and it tells whether the partition key is present or not
-
Partition key cache→ Key-value pair in memory. The key is partition and the value is the location of data for a given partition key
-
Partition index summary file→ The partition summary is a sampling of the partition keys in memory
-
Partition index file→ location for row in data file (SSTables) for a given partition key
Compaction
The concept of compaction is used for different kinds of operations in Cassandra, the common thing about these operations is that it takes one or more sstables and output new sstables.
Basically, with compaction, Cassandra combines one or more sstables into new sstables.
Why?
- To reduce read latency
- To delete data from the disk
- Merges data based on partition keys
- Keep the data with the latest timestamp
- Removes row with tombstones
- Deletes old SSTables
You can see when compaction happens in grafana https://grafana.bebit-dev.com/d/uTAH6Fyik/cassandra-by-node?refresh=1m&panelId=39&fullscreen&orgId=1.
About tombstones
When a delete request is received by Cassandra it does not actually remove the data from the underlying store. Instead it writes a special piece of data known as a tombstone. The Tombstone represents the delete and causes all values which occurred before the tombstone to not appear in queries to the database. This approach is used instead of removing values because of the distributed nature of Cassandra.
- When a delete request is received, it doesn’t actually remove the data from the disk
- Instead, it writes a special piece of data known as a tombstone
- Data with tombstone will be deleted when compaction happens
Why?
- Cassandra has the same data for each node
- Without a tombstone, what happens if one or more nodes fail to delete data?
- The data will be replicated again for other nodes
We can set gc_grace_seconds
which is the time limit to send information about “this data will be deleted with tombstone” for other nodes!
TTL
Data in Cassandra can have an additional property called time to live - this is used to automatically drop data that has expired once the time is reached.