WTF is an LSM Tree? The Secret Sauce Behind Modern Databases
WTF is an LSM Tree? The Secret Sauce Behind Modern Databases
If you've been paying attention to database technology over the past few years, you've probably noticed a pattern: every hot new database seems to be built on something called an "LSM tree." RocksDB? LSM tree. Apache Cassandra? LSM tree. ScyllaDB? LSM tree. LevelDB, HBase, InfluxDB, even parts of newer systems like CockroachDB—all LSM trees under the hood.
But here's the thing: most developers using these databases have no idea what an LSM tree actually is, how it works, or why it's suddenly everywhere. They just know these databases are "fast for writes" and "good for time-series data." It's like driving a Tesla without understanding electric motors—it works, but you're missing out on understanding why it works so differently from traditional engines.
✨
LSM trees aren't just another data structure—they represent a fundamental rethinking of how databases should handle the eternal conflict between write performance and read performance. Understanding them will change how you think about database design forever.
Today, we're going to demystify LSM trees, understand why they've taken over the database world, and most importantly, learn when you should (and shouldn't) use databases built on them.
The Fundamental Database Dilemma
Before we dive into LSM trees, we need to understand the problem they're solving. At its core, every database faces a brutal trade-off:
Fast Writes: The fastest way to write data is to just append it to the end of a file. No seeking, no reading, just pure sequential writes. Your disk (especially SSDs) loves this.
Fast Reads: The fastest way to read data is from a sorted, indexed structure where you can binary search or directly jump to what you need. B-trees are the classic example.
Space Efficiency: You don't want to store the same data multiple times or waste space on deleted records.
Traditional databases (think PostgreSQL, MySQL, most RDBMS) chose to optimize for reads using B-trees. This makes sense—for decades, most database workloads were read-heavy. But this comes at a cost: every write potentially needs to update multiple pages in the tree, causing random I/O.
⚠️
B-trees require random writes to maintain their sorted structure. On modern SSDs, random writes are about 100x slower than sequential writes. On spinning disks, it's even worse—up to 1000x slower.
Enter the LSM Tree: Write-Optimized by Design
LSM stands for "Log-Structured Merge" tree, and the name tells you everything about the philosophy: treat your data like a log (sequential, append-only) and periodically merge it to keep things organized.
Here's the brilliant insight: what if we completely separated the write path from the read path?
The genius of LSM trees is that they turn the database write problem into a series of sequential writes:
•All writes go to memory first (the MemTable)
•When memory fills up, flush to disk as an immutable sorted file (SSTable)
•Periodically merge and compact these files in the background
Let's break down each component:
🏗️
MemTable
An in-memory write buffer that stores recent writes in a sorted data structure (usually a skip list). All writes go here...
architectureHover for more
🏗️
SSTable
Sorted String Table - an immutable, sorted file format that stores key-value pairs on disk. Once written, SSTables are n...
architectureHover for more
⚡
Bloom Filter
A probabilistic data structure that can quickly tell if a key might be in an SSTable. It can have false positives but ne...
optimizationHover for more
The MemTable: Your Write Buffer
cpp
1
// Simplified MemTable structure (conceptual)
2
classMemTable{
3
private:
4
std::map data;// Often a skip list in practice
5
std::atomic size;
6
7
public:
8
voidput(const Key& key,const Value& value){
9
data[key]= value;
10
size += key.size()+ value.size();
11
12
if(size > threshold){
13
flushToDisk();// Trigger SSTable creation
14
}
15
}
16
17
optional get(const Key& key){
18
auto it = data.find(key);
19
return it != data.end()? it->second : nullopt;
20
}
21
};
The MemTable is just an in-memory data structure (usually a skip list or red-black tree) that keeps recent writes sorted by key. It's blazing fast because:
•All operations are in-memory
•No disk I/O on the write path
•Sorted structure enables fast lookups
SSTables: Immutable Sorted Files
When the MemTable fills up, it's flushed to disk as an SSTable (Sorted String Table). These are immutable files with a simple structure:
python
1
# Conceptual SSTable structure
2
classSSTable:
3
def__init__(self, data):
4
self.data_blocks = self._create_data_blocks(data)
5
self.index = self._build_index()
6
self.bloom_filter = self._build_bloom_filter()
7
self.metadata ={
8
'min_key':min(data.keys()),
9
'max_key':max(data.keys()),
10
'num_entries':len(data),
11
'timestamp': time.time()
12
}
13
14
defget(self, key):
15
# Quick rejection using bloom filter
16
ifnot self.bloom_filter.might_contain(key):
17
returnNone
18
19
# Quick rejection using metadata
20
if key < self.metadata['min_key']or key > self.metadata['max_key']:
21
returnNone
22
23
# Binary search in index to find block
24
block_offset = self._binary_search_index(key)
25
26
# Read only the relevant block
27
block = self._read_block(block_offset)
28
return block.get(key)
The beauty of SSTables:
•Immutable: Once written, never modified (simplifies concurrency)
•Sorted: Enables efficient range queries and merging
•Self-contained: Each file has its own index and metadata
•Bloom filters: Probabilistic data structure to quickly check if a key might exist
The Compaction Process: Keeping Things Tidy
Here's where things get interesting. As you keep writing, you'll accumulate many SSTables. This creates two problems:
•Read amplification: Reads might need to check many files
•Space amplification: The same key might exist in multiple files (with different versions)
The solution is compaction—periodically merging SSTables together:
💡
Read Amplification
The overhead of reading data from multiple locations to serve a single query. In LSM trees, a key might exist in multipl...
conceptHover for more
💡
Write Amplification
The ratio of actual data written to disk versus the logical data written by the application. In LSM trees, data is rewri...
conceptHover for more
💡
Space Amplification
The ratio of the actual disk space used to the logical data size. In LSM trees, the same key may exist in multiple SSTab...
conceptHover for more
Compaction Strategies: The Secret Sauce
Different databases use different compaction strategies, and this is where LSM trees get really interesting:
LSM Tree Compaction Process
Level 0 (Unsorted)
Level 1 (Sorted)
New Compacted
MemTable (In-Memory)
figpurple@300
grapegreen@301
applered-delicious@302
Level 0 (Most Recent)
L0-1 (3 keys)
Range: [apple - cherry]
applered@100
bananayellow@101
cherryred@102
L0-2 (3 keys)
Range: [apple - elderberry]
applegreen@200
datebrown@201
elderberrypurple@202
Level 1 (Older, Sorted)
L1-1 (4 keys)
Range: [apple - date]
applered@50
bananagreen@51
cherryblack@52
dateblack@53
Initial state: MemTable contains new writes. Multiple SSTables at different levels with overlapping keys. Notice the key ranges.
Total Keys
13
Unique Keys
7
Space Saved
0%
Key Concepts:
• Newer data (higher timestamps) overwrites older versions of the same key
• Compaction merges overlapping SSTables and removes obsolete versions
• Level 0 contains unsorted, overlapping files from MemTable flushes
• Lower levels have sorted, non-overlapping key ranges for efficient reads
Detailed Phase Explanations:
Performance Insights:
Write Amplification: Each key is rewritten multiple times as it moves through levels. In this demo, keys were written 0 times on average.
Space Amplification: Before compaction, the same data existed in multiple versions. Compaction reduced this by ~40% in our example.
Read Amplification: Without compaction, reads would need to check 3 files. After compaction, Level 1 queries need to check just 1 file.
Leveled Compaction (RocksDB, LevelDB)
python
1
# Simplified leveled compaction logic
2
classLeveledCompaction:
3
def__init__(self):
4
self.levels =[[]for _ inrange(7)]# L0 to L6
5
self.level_max_bytes =[
6
10* MB,# L0: 10MB
7
100* MB,# L1: 100MB
8
1* GB,# L2: 1GB
9
10* GB,# L3: 10GB
10
100* GB,# L4: 100GB
11
1* TB,# L5: 1TB
12
10* TB # L6: 10TB
13
]
14
15
defcompact(self, level):
16
if self.get_level_size(level)> self.level_max_bytes[level]:
17
# Pick overlapping SSTables from level and level+1
Pros: Perfect for time-series data where you query recent data more often
Cons: Not suitable for random access patterns
Real-World LSM Trees in Action
Let's look at how different databases use LSM trees and what makes each unique:
RocksDB: The Swiss Army Knife
Facebook's RocksDB is probably the most widely embedded LSM tree implementation. It's used in everything from MySQL (MyRocks storage engine) to Kafka Streams to blockchain databases.
# B-Tree: ~300,000 reads/second (direct index lookup)
26
pass
✨
The 10x Rule: LSM trees often provide 10x better write performance than B-trees, while B-trees typically provide 1.5x better read performance. For write-heavy workloads, this trade-off is a no-brainer.
When to Use LSM Tree Databases
LSM trees excel in specific scenarios:
✅ Perfect Use Cases
1. Time-Series Data
python
1
# IoT sensor data, metrics, logs
2
sensor_data ={
3
"timestamp":1699564800,
4
"sensor_id":"temp_sensor_42",
5
"temperature":23.5,
6
"humidity":45.2
7
}
8
# Append-only pattern = LSM heaven
9
tsdb.write(sensor_data)
2. Write-Heavy Workloads
•Event streaming (Kafka uses RocksDB for state stores)
The Tuning Trap: Don't over-tune! Start with defaults, measure your actual workload, then tune based on real bottlenecks. Most applications work fine with default settings.
The Future of LSM Trees
LSM trees continue to evolve:
Hybrid Approaches
Some databases are combining LSM trees with other structures:
cpp
1
// TiDB uses LSM tree for recent data, B-tree for historical
•Computational Storage: Offload compaction to smart SSDs
•GPU Acceleration: Parallel merge operations during compaction
Learned Indexes
Research into using machine learning to predict which SSTable contains a key:
python
1
# Instead of checking bloom filters, use ML model
2
classLearnedLSM:
3
def__init__(self):
4
self.model = train_key_distribution_model()
5
6
defread(self, key):
7
# Model predicts which SSTable likely contains the key
8
predicted_sstable = self.model.predict(key)
9
10
# Check predicted SSTable first
11
if result := predicted_sstable.get(key):
12
return result
13
14
# Fall back to regular search
15
return self.regular_search(key)
Conclusion: LSM Trees Are Eating the Database World
LSM trees represent a fundamental shift in database design philosophy: instead of optimizing for the general case, optimize for the common case (writes) and make the uncommon case (reads) good enough.
💡
Key Takeaways:
•LSM trees trade read performance for write performance - a winning trade for many modern workloads
•They turn random writes into sequential writes - crucial for SSD longevity and performance
•Compaction strategies determine the exact trade-offs - choose based on your workload
•They're not a silver bullet - B-trees still win for read-heavy OLTP workloads
The explosion of LSM tree adoption isn't an accident. In a world of event streams, time-series data, log aggregation, and high-velocity data ingestion, LSM trees provide exactly what modern applications need: the ability to sustainably handle massive write throughput without breaking the bank on hardware.
So next time you're evaluating databases and see "LSM tree-based storage engine," you'll know exactly what that means: blazing fast writes, pretty good reads, and a design philosophy that embraces the realities of modern hardware and workloads.
Time to go compact some SSTables. These blog posts don't write themselves to disk, you know.