More than 3 years have passed since last update.

The Basics of Cassandra

Posted at 2022-02-27

This document might not be correct.

What is Cassandra?

The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.
…
Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Goal

Understand concepts of Cassandra
Understand how Cassandra reads, writes, and deletes data
Understand how our configuration about Cassandra

Words

Node

→ Where you store your data. It is the basic infrastructure component of Cassandra.

Coordinator node:
When a request is sent to any Cassandra node, this node acts as a proxy for the application (actually, the Cassandra driver) and the nodes involved in the request flow. This proxy node is called as the coordinator. (ref. https://medium.com/jorgeacetozi/cassandra-architecture-and-write-path-anatomy-51e339bcfe0c)

Datacenter

→ A collection of related nodes. A datacenter can be a physical datacenter or virtual datacenter.

Cluster

→ A cluster contains one or more datacenters. It can span physical locations.

Cassandra Data model

Columns - name-value pair
Row - container for columns referenced by primary key
Table - container for rows
Keyspace - container for tables that span one or more nodes (It is like a database in relational database system)

“Keys” in Cassandra

Primary key = partition key + clustering key
- Partition key decides how data is distributed across nodes
- Clustering key decides how data is stored on a single node

How data is distributed?

A hashing function called partitioner is used to generate token values
1. partition keys are used to generate the token
Every node is assigned a unique token/range of token that determines which row will go to which node.

Cassandra Architecture

Replication

Replication is necessary for high availability and we can decide how many nodes will have the same data (replication factor) and how data is replicated (replication strategy).

About replication strategy and factor

(ref. https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlCreateKeyspace.html#Table2.Replicationstrategyclassandfactorsettings)

Note: With gossip protocol and snitch, each node know which node has the same data

Write consistency

The consistency in Cassandra is tunable. It means we can change the consistency of data.

Increase or decrease consistency
Trade-off between consistency & performance
Configure consistency for read & write separately

We can set

ONE → A write must be written to the commit log and memtable of at least one replica node.

ALL → A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition.

QUORAM → A write must be written to the commit log and memtable on a quorum of replica nodes across all datacenters.
LOCAL_QUORAM → A write must be written to the commit log and memtable on a quorum of replica nodes in the same datacenter as the coordinator.

Read consistency

Basically, it is similar to the write consistency. You can refer to the details in https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbInternals/dbIntConfigConsistency.html#Readconsistencylevels.

Storage

CommitLogs

Commitlogs are an append only log of all mutations local to a Cassandra node. Any data written to Cassandra will first be written to a commit log before being written to a memtable. This provides durability in the case of unexpected shutdown. On startup, any mutations in the commit log will be applied to memtables.

Any data written will first be written to a commit log

Memtables

Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. Eventually, memtables are flushed onto disk and become immutable SSTables.

This can be triggered in several ways:

The memory usage of the memtables exceeds the configured threshold (see memtable_cleanup_threshold)
The CommitLog approaches its maximum size, and forces memtable flushes in order to allow commitlog segments to be freed
In-memory structures where Cassandra buffers write
It is created from commitlog
Eventually, memtables are flushed onto disk and become immutable SSTables

SSTable

SSTables are the immutable data files that Cassandra uses for persisting data on disk.
As SSTables are flushed to disk from Memtables or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.

SSTables are immutable data files and they are used for persisting data on disk

What is the difference between commitlog and sstable?

The commitlogs are like files that records write operations synchronously (it is also optimized for writing). On the other hand, sstables are immutable data files on disk and that are created asynchronously.

Without commitlog, what happens if node is down before it flushes memtables into sstables ?

→ We might lose data because some of the data exist only in memtables.

However, in this case, if we have commitlog, we can make the data persistent with sstables.

You can refer to the details in https://stackoverflow.com/questions/34592948/what-is-the-purpose-of-cassandras-commit-log.

How write happens?

Write operation is simpler than read operation

Write operation
It is written in commitlog
Memtables are generated from the commitlog
Eventually, memtables are flushed onto disk and immutable sstables are created

How read happens?

Read operation is complicated because it may require data in memtables and some of sstables.

Row cache
→ subsets of rows in memory
Bloom filter→ Each SSTable has bloom filter and it tells whether the partition key is present or not
Partition key cache→ Key-value pair in memory. The key is partition and the value is the location of data for a given partition key
Partition index summary file→ The partition summary is a sampling of the partition keys in memory
Partition index file→ location for row in data file (SSTables) for a given partition key

Compaction

The concept of compaction is used for different kinds of operations in Cassandra, the common thing about these operations is that it takes one or more sstables and output new sstables.

Basically, with compaction, Cassandra combines one or more sstables into new sstables.

Why?

To reduce read latency
To delete data from the disk

Merges data based on partition keys
Keep the data with the latest timestamp
Removes row with tombstones
Deletes old SSTables

You can see when compaction happens in grafana https://grafana.bebit-dev.com/d/uTAH6Fyik/cassandra-by-node?refresh=1m&panelId=39&fullscreen&orgId=1.

About tombstones

When a delete request is received by Cassandra it does not actually remove the data from the underlying store. Instead it writes a special piece of data known as a tombstone. The Tombstone represents the delete and causes all values which occurred before the tombstone to not appear in queries to the database. This approach is used instead of removing values because of the distributed nature of Cassandra.

When a delete request is received, it doesn’t actually remove the data from the disk
Instead, it writes a special piece of data known as a tombstone
Data with tombstone will be deleted when compaction happens

Why?

Cassandra has the same data for each node
Without a tombstone, what happens if one or more nodes fail to delete data?
The data will be replicated again for other nodes

We can set gc_grace_seconds which is the time limit to send information about “this data will be deleted with tombstone” for other nodes!

TTL

Data in Cassandra can have an additional property called time to live - this is used to automatically drop data that has expired once the time is reached.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up