Skip to main content
2023-11-1518 min read
Databases

WTF is an LSM Tree? The Secret Sauce Behind Modern Databases

WTF is an LSM Tree? The Secret Sauce Behind Modern Databases

If you've been paying attention to database technology over the past few years, you've probably noticed a pattern: every hot new database seems to be built on something called an "LSM tree." RocksDB? LSM tree. Apache Cassandra? LSM tree. ScyllaDB? LSM tree. LevelDB, HBase, InfluxDB, even parts of newer systems like CockroachDB—all LSM trees under the hood.
But here's the thing: most developers using these databases have no idea what an LSM tree actually is, how it works, or why it's suddenly everywhere. They just know these databases are "fast for writes" and "good for time-series data." It's like driving a Tesla without understanding electric motors—it works, but you're missing out on understanding why it works so differently from traditional engines.
LSM trees aren't just another data structure—they represent a fundamental rethinking of how databases should handle the eternal conflict between write performance and read performance. Understanding them will change how you think about database design forever.
Today, we're going to demystify LSM trees, understand why they've taken over the database world, and most importantly, learn when you should (and shouldn't) use databases built on them.

The Fundamental Database Dilemma

Before we dive into LSM trees, we need to understand the problem they're solving. At its core, every database faces a brutal trade-off:

The Database Triangle of Sadness

Trade-off

Trade-off

Trade-off

Fast Writes

Fast Reads

Space Efficiency

Append-only is fastest

Sorted data reads fastest

No duplicates saves space

Fast Writes: The fastest way to write data is to just append it to the end of a file. No seeking, no reading, just pure sequential writes. Your disk (especially SSDs) loves this.
Fast Reads: The fastest way to read data is from a sorted, indexed structure where you can binary search or directly jump to what you need. B-trees are the classic example.
Space Efficiency: You don't want to store the same data multiple times or waste space on deleted records.
Traditional databases (think PostgreSQL, MySQL, most RDBMS) chose to optimize for reads using B-trees. This makes sense—for decades, most database workloads were read-heavy. But this comes at a cost: every write potentially needs to update multiple pages in the tree, causing random I/O.
⚠️
B-trees require random writes to maintain their sorted structure. On modern SSDs, random writes are about 100x slower than sequential writes. On spinning disks, it's even worse—up to 1000x slower.

Enter the LSM Tree: Write-Optimized by Design

LSM stands for "Log-Structured Merge" tree, and the name tells you everything about the philosophy: treat your data like a log (sequential, append-only) and periodically merge it to keep things organized.
Here's the brilliant insight: what if we completely separated the write path from the read path?

When full

Compaction

Compaction

Compaction

1

2

3

4

5

New Write

MemTable
(In-Memory)

SSTable L0
(Immutable)

SSTable L1

SSTable L2

SSTable L3

Read Request

Read Path

The genius of LSM trees is that they turn the database write problem into a series of sequential writes:
  1. All writes go to memory first (the MemTable)
  2. When memory fills up, flush to disk as an immutable sorted file (SSTable)
  3. Periodically merge and compact these files in the background
Let's break down each component:
🏗️

MemTable

An in-memory write buffer that stores recent writes in a sorted data structure (usually a skip list). All writes go here...

architectureHover for more
🏗️

SSTable

Sorted String Table - an immutable, sorted file format that stores key-value pairs on disk. Once written, SSTables are n...

architectureHover for more

Bloom Filter

A probabilistic data structure that can quickly tell if a key might be in an SSTable. It can have false positives but ne...

optimizationHover for more

The MemTable: Your Write Buffer

cpp
1// Simplified MemTable structure (conceptual)
2class MemTable {
3private:
4 std::map data; // Often a skip list in practice
5 std::atomic size;
6
7public:
8 void put(const Key& key, const Value& value) {
9 data[key] = value;
10 size += key.size() + value.size();
11
12 if (size > threshold) {
13 flushToDisk(); // Trigger SSTable creation
14 }
15 }
16
17 optional get(const Key& key) {
18 auto it = data.find(key);
19 return it != data.end() ? it->second : nullopt;
20 }
21};
The MemTable is just an in-memory data structure (usually a skip list or red-black tree) that keeps recent writes sorted by key. It's blazing fast because:
  • All operations are in-memory
  • No disk I/O on the write path
  • Sorted structure enables fast lookups

SSTables: Immutable Sorted Files

When the MemTable fills up, it's flushed to disk as an SSTable (Sorted String Table). These are immutable files with a simple structure:

SSTable Structure

Data Blocks
key1:value1
key2:value2
...

Index Block
key1:offset1
key100:offset2
...

Metadata
min_key, max_key
timestamp, size

Bloom Filter
Probabilistic
membership test

python
1# Conceptual SSTable structure
2class SSTable:
3 def __init__(self, data):
4 self.data_blocks = self._create_data_blocks(data)
5 self.index = self._build_index()
6 self.bloom_filter = self._build_bloom_filter()
7 self.metadata = {
8 'min_key': min(data.keys()),
9 'max_key': max(data.keys()),
10 'num_entries': len(data),
11 'timestamp': time.time()
12 }
13
14 def get(self, key):
15 # Quick rejection using bloom filter
16 if not self.bloom_filter.might_contain(key):
17 return None
18
19 # Quick rejection using metadata
20 if key < self.metadata['min_key'] or key > self.metadata['max_key']:
21 return None
22
23 # Binary search in index to find block
24 block_offset = self._binary_search_index(key)
25
26 # Read only the relevant block
27 block = self._read_block(block_offset)
28 return block.get(key)
The beauty of SSTables:
  • Immutable: Once written, never modified (simplifies concurrency)
  • Sorted: Enables efficient range queries and merging
  • Self-contained: Each file has its own index and metadata
  • Bloom filters: Probabilistic data structure to quickly check if a key might exist

The Compaction Process: Keeping Things Tidy

Here's where things get interesting. As you keep writing, you'll accumulate many SSTables. This creates two problems:
  1. Read amplification: Reads might need to check many files
  2. Space amplification: The same key might exist in multiple files (with different versions)
The solution is compaction—periodically merging SSTables together:
💡

Read Amplification

The overhead of reading data from multiple locations to serve a single query. In LSM trees, a key might exist in multipl...

conceptHover for more
💡

Write Amplification

The ratio of actual data written to disk versus the logical data written by the application. In LSM trees, data is rewri...

conceptHover for more
💡

Space Amplification

The ratio of the actual disk space used to the logical data size. In LSM trees, the same key may exist in multiple SSTab...

conceptHover for more
Level 2 SSTablesLevel 1 SSTablesLevel 0 SSTablesLSMAppLevel 2 SSTablesLevel 1 SSTablesLevel 0 SSTablesLSMAppk1:v1, k2:v2, k1:v3k1:v3 (latest), k2:v2Old k1:v1 deletedLarger, fewer filesWrite k1:v1Write k2:v2Write k1:v3 (update)Flush MemTableCompact when fullCompact when full

Compaction Strategies: The Secret Sauce

Different databases use different compaction strategies, and this is where LSM trees get really interesting:
LSM Tree Compaction Process
Level 0 (Unsorted)
Level 1 (Sorted)
New Compacted
MemTable (In-Memory)
figpurple@300
grapegreen@301
applered-delicious@302
Level 0 (Most Recent)
L0-1 (3 keys)
Range: [apple - cherry]
applered@100
bananayellow@101
cherryred@102
L0-2 (3 keys)
Range: [apple - elderberry]
applegreen@200
datebrown@201
elderberrypurple@202
Level 1 (Older, Sorted)
L1-1 (4 keys)
Range: [apple - date]
applered@50
bananagreen@51
cherryblack@52
dateblack@53

Initial state: MemTable contains new writes. Multiple SSTables at different levels with overlapping keys. Notice the key ranges.

Total Keys
13
Unique Keys
7
Space Saved
0%
Key Concepts:
  • • Newer data (higher timestamps) overwrites older versions of the same key
  • • Compaction merges overlapping SSTables and removes obsolete versions
  • • Level 0 contains unsorted, overlapping files from MemTable flushes
  • • Lower levels have sorted, non-overlapping key ranges for efficient reads
Detailed Phase Explanations:
Performance Insights:

Write Amplification: Each key is rewritten multiple times as it moves through levels. In this demo, keys were written 0 times on average.

Space Amplification: Before compaction, the same data existed in multiple versions. Compaction reduced this by ~40% in our example.

Read Amplification: Without compaction, reads would need to check 3 files. After compaction, Level 1 queries need to check just 1 file.

Leveled Compaction (RocksDB, LevelDB)

python
1# Simplified leveled compaction logic
2class LeveledCompaction:
3 def __init__(self):
4 self.levels = [[] for _ in range(7)] # L0 to L6
5 self.level_max_bytes = [
6 10 * MB, # L0: 10MB
7 100 * MB, # L1: 100MB
8 1 * GB, # L2: 1GB
9 10 * GB, # L3: 10GB
10 100 * GB, # L4: 100GB
11 1 * TB, # L5: 1TB
12 10 * TB # L6: 10TB
13 ]
14
15 def compact(self, level):
16 if self.get_level_size(level) > self.level_max_bytes[level]:
17 # Pick overlapping SSTables from level and level+1
18 files_to_compact = self.pick_compaction_files(level)
19
20 # Merge sort all entries, removing duplicates
21 merged_data = self.merge_sort_files(files_to_compact)
22
23 # Write new SSTables to level+1
24 new_sstables = self.write_sstables(merged_data, level + 1)
25
26 # Delete old files
27 self.delete_files(files_to_compact)
Pros: Minimizes read amplification (each level has non-overlapping key ranges) Cons: Higher write amplification (data gets rewritten multiple times)

Size-Tiered Compaction (Cassandra, ScyllaDB)

python
1# Simplified size-tiered compaction
2class SizeTieredCompaction:
3 def __init__(self):
4 self.min_threshold = 4 # Minimum files of similar size
5 self.max_threshold = 32 # Maximum files to compact at once
6
7 def find_compaction_candidates(self, sstables):
8 # Group SSTables by size (within 50% of each other)
9 buckets = self.bucket_by_size(sstables)
10
11 for bucket in buckets:
12 if self.min_threshold <= len(bucket) <= self.max_threshold:
13 return bucket[:self.max_threshold]
14
15 return None
16
17 def compact(self):
18 candidates = self.find_compaction_candidates(self.all_sstables)
19 if candidates:
20 merged = self.merge_sstables(candidates)
21 self.write_new_sstable(merged)
22 self.delete_old_sstables(candidates)
Pros: Lower write amplification Cons: Higher read amplification, temporary space usage

Time-Window Compaction (Time-Series Optimized)

python
1# Time-window compaction for time-series data
2class TimeWindowCompaction:
3 def __init__(self, window_size=timedelta(hours=1)):
4 self.window_size = window_size
5
6 def compact(self):
7 # Group SSTables by time window
8 windows = defaultdict(list)
9 for sstable in self.all_sstables:
10 window = self.get_time_window(sstable.min_timestamp)
11 windows[window].append(sstable)
12
13 # Compact each window separately
14 for window, sstables in windows.items():
15 if self.should_compact_window(window, sstables):
16 self.compact_window(sstables)
Pros: Perfect for time-series data where you query recent data more often Cons: Not suitable for random access patterns

Real-World LSM Trees in Action

Let's look at how different databases use LSM trees and what makes each unique:

RocksDB: The Swiss Army Knife

Facebook's RocksDB is probably the most widely embedded LSM tree implementation. It's used in everything from MySQL (MyRocks storage engine) to Kafka Streams to blockchain databases.
cpp
1// Using RocksDB (C++)
2#include
3
4rocksdb::DB* db;
5rocksdb::Options options;
6options.create_if_missing = true;
7
8// Tune for write-heavy workload
9options.write_buffer_size = 64 * 1024 * 1024; // 64MB MemTable
10options.max_write_buffer_number = 3; // 3 MemTables
11options.level0_file_num_compaction_trigger = 4; // Compact after 4 files
12
13rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
14
15// Writes are fast
16db->Put(rocksdb::WriteOptions(), "key1", "value1");
17
18// Reads check multiple places
19std::string value;
20db->Get(rocksdb::ReadOptions(), "key1", &value);
RocksDB's Secret Weapons:
  • Column families (like separate LSM trees within one database)
  • Pluggable compaction strategies
  • Extensive tuning options (hundreds of knobs)
  • Support for transactions and snapshots

Cassandra/ScyllaDB: Distributed LSM Trees

Cassandra takes LSM trees and distributes them across multiple nodes:
java
1// Cassandra's distributed LSM model
2public class CassandraLSM {
3 // Each node has its own LSM tree for its partition range
4 private final Map, LSMTree> localTrees;
5
6 public void write(PartitionKey key, Row row) {
7 Token token = partitioner.getToken(key);
8 Node coordinator = getCoordinator(token);
9
10 if (coordinator.equals(localNode)) {
11 // Write to local LSM tree
12 LSMTree tree = localTrees.get(getRangeForToken(token));
13 tree.write(key, row);
14
15 // Replicate to other nodes
16 for (Node replica : getReplicaNodes(token)) {
17 replica.replicateWrite(key, row);
18 }
19 }
20 }
21}
Distributed LSM Advantages:
  • Each node only compacts its own data
  • Writes are distributed across the cluster
  • Natural sharding by partition key

LevelDB/Chrome: LSM Trees in Your Browser

Yes, Chrome uses an LSM tree (LevelDB) for IndexedDB:
javascript
1// Your browser is running an LSM tree right now!
2const request = indexedDB.open("MyDatabase", 1);
3
4request.onsuccess = (event) => {
5 const db = event.target.result;
6 const transaction = db.transaction(["users"], "readwrite");
7 const store = transaction.objectStore("users");
8
9 // This write goes into LevelDB's MemTable
10 store.put({ id: 1, name: "Alice", email: "alice@example.com" });
11
12 // This read might check MemTable + multiple SSTables
13 const getRequest = store.get(1);
14};

LSM Trees vs B-Trees: The Showdown

Let's compare LSM trees with traditional B-trees across various dimensions:
CharacteristicLSM TreeB-TreeWinner
Write PerformanceSequential writes onlyRandom writes requiredLSM Tree 🏆
Read PerformanceMay check multiple filesDirect path to dataB-Tree 🏆
Space EfficiencyTemporary duplicationNo duplicationB-Tree 🏆
SSD FriendlinessReduces write wearCauses write amplificationLSM Tree 🏆
Crash RecoverySimple (replay log)Complex (fix tree structure)LSM Tree 🏆
CompressionExcellent (immutable files)Limited (in-place updates)LSM Tree 🏆

Performance in Numbers

Here's what this means in practice:
python
1# Benchmark comparison (conceptual)
2class DatabaseBenchmark:
3 def __init__(self):
4 self.lsm_db = RocksDB()
5 self.btree_db = PostgreSQL()
6
7 def benchmark_writes(self, num_records=1_000_000):
8 # LSM Tree: ~500,000 writes/second
9 start = time.time()
10 for i in range(num_records):
11 self.lsm_db.put(f"key_{i}", f"value_{i}")
12 lsm_time = time.time() - start
13
14 # B-Tree: ~50,000 writes/second (with durability)
15 start = time.time()
16 for i in range(num_records):
17 self.btree_db.insert("table", {"key": f"key_{i}", "value": f"value_{i}"})
18 btree_time = time.time() - start
19
20 print(f"LSM writes/sec: {num_records / lsm_time:,.0f}")
21 print(f"B-tree writes/sec: {num_records / btree_time:,.0f}")
22
23 def benchmark_reads(self, num_reads=100_000):
24 # LSM Tree: ~200,000 reads/second (with bloom filters)
25 # B-Tree: ~300,000 reads/second (direct index lookup)
26 pass
The 10x Rule: LSM trees often provide 10x better write performance than B-trees, while B-trees typically provide 1.5x better read performance. For write-heavy workloads, this trade-off is a no-brainer.

When to Use LSM Tree Databases

LSM trees excel in specific scenarios:

✅ Perfect Use Cases

1. Time-Series Data
python
1# IoT sensor data, metrics, logs
2sensor_data = {
3 "timestamp": 1699564800,
4 "sensor_id": "temp_sensor_42",
5 "temperature": 23.5,
6 "humidity": 45.2
7}
8# Append-only pattern = LSM heaven
9tsdb.write(sensor_data)
2. Write-Heavy Workloads
  • Event streaming (Kafka uses RocksDB for state stores)
  • User activity tracking
  • Log aggregation
  • Real-time analytics ingestion
3. High-Throughput Applications
java
1// Message queue backed by LSM tree
2public class HighThroughputQueue {
3 private final RocksDB db;
4 private final AtomicLong writeCounter = new AtomicLong();
5
6 public void enqueue(Message message) {
7 long id = writeCounter.incrementAndGet();
8 byte[] key = ByteBuffer.allocate(8).putLong(id).array();
9 db.put(key, serialize(message));
10 // Can handle millions of messages per second
11 }
12}
4. Large Datasets with Sequential Access
  • Blockchain storage (each block appends)
  • Commit logs
  • Audit trails

❌ Poor Use Cases

1. Random Reads
sql
1-- This query pattern is painful for LSM trees
2SELECT * FROM users WHERE user_id = ? -- Random key lookup
3-- LSM tree might check 5-10 SSTables
4-- B-tree goes directly to the leaf page
2. Frequent Updates to Existing Records
python
1# LSM trees don't update in place
2# This creates multiple versions
3for _ in range(1000):
4 db.put("hot_key", new_value) # Creates 1000 versions!
5 # All versions exist until compaction
3. Strong Consistency Requirements
  • Bank account balances (need immediate consistency)
  • Inventory systems (can't oversell)
  • Traditional OLTP workloads
4. Small Datasets
  • If your data fits in memory, LSM overhead isn't worth it
  • B-trees are simpler for small data

Tuning LSM Trees: The Art and Science

The dirty secret of LSM trees is that they require tuning to perform well. Here are the key knobs:

Write Amplification vs Read Amplification

python
1class LSMTuning:
2 def __init__(self):
3 # Larger MemTable = fewer SSTables = better reads, worse memory usage
4 self.memtable_size = 256 * MB
5
6 # More levels = better read performance, worse write amplification
7 self.num_levels = 7
8
9 # Aggressive compaction = better reads, worse writes
10 self.compaction_style = "leveled" # or "tiered" for write-optimized
11
12 # Bloom filter tuning
13 self.bloom_bits_per_key = 10 # ~1% false positive rate

Real-World Tuning Example

yaml
1# Cassandra tuning for write-heavy workload
2compaction:
3 class: SizeTieredCompactionStrategy
4 min_threshold: 4
5 max_threshold: 32
6
7memtable_flush_writers: 4
8memtable_heap_space_in_mb: 2048
9memtable_offheap_space_in_mb: 2048
10
11# RocksDB tuning for read-heavy workload
12rocksdb_options:
13 level_compaction_dynamic_level_bytes: true
14 max_bytes_for_level_base: 512MB
15 target_file_size_base: 64MB
16 block_cache_size: 8GB
17 bloom_locality: 1
⚠️
The Tuning Trap: Don't over-tune! Start with defaults, measure your actual workload, then tune based on real bottlenecks. Most applications work fine with default settings.

The Future of LSM Trees

LSM trees continue to evolve:

Hybrid Approaches

Some databases are combining LSM trees with other structures:
cpp
1// TiDB uses LSM tree for recent data, B-tree for historical
2class HybridStorage {
3 LSMTree recent_data; // Last 24 hours
4 BTree historical_data; // Older than 24 hours
5
6 void write(Key k, Value v) {
7 recent_data.put(k, v);
8 // Background job moves to B-tree after 24 hours
9 }
10};

Hardware Acceleration

  • Persistent Memory: Intel Optane enables larger MemTables
  • Computational Storage: Offload compaction to smart SSDs
  • GPU Acceleration: Parallel merge operations during compaction

Learned Indexes

Research into using machine learning to predict which SSTable contains a key:
python
1# Instead of checking bloom filters, use ML model
2class LearnedLSM:
3 def __init__(self):
4 self.model = train_key_distribution_model()
5
6 def read(self, key):
7 # Model predicts which SSTable likely contains the key
8 predicted_sstable = self.model.predict(key)
9
10 # Check predicted SSTable first
11 if result := predicted_sstable.get(key):
12 return result
13
14 # Fall back to regular search
15 return self.regular_search(key)

Conclusion: LSM Trees Are Eating the Database World

LSM trees represent a fundamental shift in database design philosophy: instead of optimizing for the general case, optimize for the common case (writes) and make the uncommon case (reads) good enough.
💡
Key Takeaways:
  1. LSM trees trade read performance for write performance - a winning trade for many modern workloads
  2. They turn random writes into sequential writes - crucial for SSD longevity and performance
  3. Compaction strategies determine the exact trade-offs - choose based on your workload
  4. They're not a silver bullet - B-trees still win for read-heavy OLTP workloads
The explosion of LSM tree adoption isn't an accident. In a world of event streams, time-series data, log aggregation, and high-velocity data ingestion, LSM trees provide exactly what modern applications need: the ability to sustainably handle massive write throughput without breaking the bank on hardware.
So next time you're evaluating databases and see "LSM tree-based storage engine," you'll know exactly what that means: blazing fast writes, pretty good reads, and a design philosophy that embraces the realities of modern hardware and workloads.
Time to go compact some SSTables. These blog posts don't write themselves to disk, you know.

Part of

WTF Series

Part 1 of 2