WTF is an LSM Tree? The Secret Sauce Behind Modern Databases

If you've been paying attention to database technology over the past few years, you've probably noticed a pattern: every hot new database seems to be built on something called an "LSM tree." RocksDB? LSM tree. Apache Cassandra? LSM tree. ScyllaDB? LSM tree. LevelDB, HBase, InfluxDB, even parts of newer systems like CockroachDB—all LSM trees under the hood.

But here's the thing: most developers using these databases have no idea what an LSM tree actually is, how it works, or why it's suddenly everywhere. They just know these databases are "fast for writes" and "good for time-series data." It's like driving a Tesla without understanding electric motors—it works, but you're missing out on understanding why it works so differently from traditional engines.

✨

LSM trees aren't just another data structure—they represent a fundamental rethinking of how databases should handle the eternal conflict between write performance and read performance. Understanding them will change how you think about database design forever.

Today, we're going to demystify LSM trees, understand why they've taken over the database world, and most importantly, learn when you should (and shouldn't) use databases built on them.

The Fundamental Database Dilemma

Before we dive into LSM trees, we need to understand the problem they're solving. At its core, every database faces a brutal trade-off:

Fast Writes: The fastest way to write data is to just append it to the end of a file. No seeking, no reading, just pure sequential writes. Your disk (especially SSDs) loves this.

Fast Reads: The fastest way to read data is from a sorted, indexed structure where you can binary search or directly jump to what you need. B-trees are the classic example.

Space Efficiency: You don't want to store the same data multiple times or waste space on deleted records.

Traditional databases (think PostgreSQL, MySQL, most RDBMS) chose to optimize for reads using B-trees. This makes sense—for decades, most database workloads were read-heavy. But this comes at a cost: every write potentially needs to update multiple pages in the tree, causing random I/O.

⚠️

B-trees require random writes to maintain their sorted structure. On modern SSDs, random writes are about 100x slower than sequential writes. On spinning disks, it's even worse—up to 1000x slower.

Enter the LSM Tree: Write-Optimized by Design

LSM stands for "Log-Structured Merge" tree, and the name tells you everything about the philosophy: treat your data like a log (sequential, append-only) and periodically merge it to keep things organized.

Here's the brilliant insight: what if we completely separated the write path from the read path?

The genius of LSM trees is that they turn the database write problem into a series of sequential writes:

•All writes go to memory first (the MemTable)
•When memory fills up, flush to disk as an immutable sorted file (SSTable)
•Periodically merge and compact these files in the background

Let's break down each component:

🏗️

MemTable

An in-memory write buffer that stores recent writes in a sorted data structure (usually a skip list). All writes go here...

architectureHover for more

🏗️

SSTable

Sorted String Table - an immutable, sorted file format that stores key-value pairs on disk. Once written, SSTables are n...

architectureHover for more

⚡

Bloom Filter

A probabilistic data structure that can quickly tell if a key might be in an SSTable. It can have false positives but ne...

optimizationHover for more

The MemTable: Your Write Buffer

cpp

1 // Simplified MemTable structure (conceptual)
2 class MemTable {
3 private:
4     std::map data;  // Often a skip list in practice
5     std::atomic size;
6     
7 public:
8     void put(const Key& key, const Value& value) {
9         data[key] = value;
10         size += key.size() + value.size();
11         
12         if (size > threshold) {
13             flushToDisk();  // Trigger SSTable creation
14         }
15     }
16     
17     optional get(const Key& key) {
18         auto it = data.find(key);
19         return it != data.end() ? it->second : nullopt;
20     }
21 };

The MemTable is just an in-memory data structure (usually a skip list or red-black tree) that keeps recent writes sorted by key. It's blazing fast because:

•All operations are in-memory
•No disk I/O on the write path
•Sorted structure enables fast lookups

SSTables: Immutable Sorted Files

When the MemTable fills up, it's flushed to disk as an SSTable (Sorted String Table). These are immutable files with a simple structure:

python

1 # Conceptual SSTable structure
2 class SSTable:
3     def __init__(self, data):
4         self.data_blocks = self._create_data_blocks(data)
5         self.index = self._build_index()
6         self.bloom_filter = self._build_bloom_filter()
7         self.metadata = {
8             'min_key': min(data.keys()),
9             'max_key': max(data.keys()),
10             'num_entries': len(data),
11             'timestamp': time.time()
12         }
13     
14     def get(self, key):
15         # Quick rejection using bloom filter
16         if not self.bloom_filter.might_contain(key):
17             return None
18             
19         # Quick rejection using metadata
20         if key < self.metadata['min_key'] or key > self.metadata['max_key']:
21             return None
22             
23         # Binary search in index to find block
24         block_offset = self._binary_search_index(key)
25         
26         # Read only the relevant block
27         block = self._read_block(block_offset)
28         return block.get(key)

The beauty of SSTables:

•Immutable: Once written, never modified (simplifies concurrency)
•Sorted: Enables efficient range queries and merging
•Self-contained: Each file has its own index and metadata
•Bloom filters: Probabilistic data structure to quickly check if a key might exist

The Compaction Process: Keeping Things Tidy

Here's where things get interesting. As you keep writing, you'll accumulate many SSTables. This creates two problems:

•Read amplification: Reads might need to check many files
•Space amplification: The same key might exist in multiple files (with different versions)

The solution is compaction—periodically merging SSTables together:

💡

Read Amplification

The overhead of reading data from multiple locations to serve a single query. In LSM trees, a key might exist in multipl...

conceptHover for more

💡

Write Amplification

The ratio of actual data written to disk versus the logical data written by the application. In LSM trees, data is rewri...

conceptHover for more

💡

Space Amplification

The ratio of the actual disk space used to the logical data size. In LSM trees, the same key may exist in multiple SSTab...

conceptHover for more

Compaction Strategies: The Secret Sauce

Different databases use different compaction strategies, and this is where LSM trees get really interesting:

LSM Tree Compaction Process

Level 0 (Unsorted)

Level 1 (Sorted)

New Compacted

MemTable (In-Memory)

figpurple@300

grapegreen@301

applered-delicious@302

Level 0 (Most Recent)

L0-1 (3 keys)

Range: [apple - cherry]

applered@100

bananayellow@101

cherryred@102

L0-2 (3 keys)

Range: [apple - elderberry]

applegreen@200

datebrown@201

elderberrypurple@202

Level 1 (Older, Sorted)

L1-1 (4 keys)

Range: [apple - date]

applered@50

bananagreen@51

cherryblack@52

dateblack@53

Initial state: MemTable contains new writes. Multiple SSTables at different levels with overlapping keys. Notice the key ranges.

Total Keys

Unique Keys

Space Saved

Key Concepts:

• Newer data (higher timestamps) overwrites older versions of the same key
• Compaction merges overlapping SSTables and removes obsolete versions
• Level 0 contains unsorted, overlapping files from MemTable flushes
• Lower levels have sorted, non-overlapping key ranges for efficient reads

Detailed Phase Explanations:

Performance Insights:

Write Amplification: Each key is rewritten multiple times as it moves through levels. In this demo, keys were written 0 times on average.

Space Amplification: Before compaction, the same data existed in multiple versions. Compaction reduced this by ~40% in our example.

Read Amplification: Without compaction, reads would need to check 3 files. After compaction, Level 1 queries need to check just 1 file.

Leveled Compaction (RocksDB, LevelDB)

python

1 # Simplified leveled compaction logic
2 class LeveledCompaction:
3     def __init__(self):
4         self.levels = [[] for _ in range(7)]  # L0 to L6
5         self.level_max_bytes = [
6             10 * MB,      # L0: 10MB
7             100 * MB,     # L1: 100MB  
8             1 * GB,       # L2: 1GB
9             10 * GB,      # L3: 10GB
10             100 * GB,     # L4: 100GB
11             1 * TB,       # L5: 1TB
12             10 * TB       # L6: 10TB
13         ]
14     
15     def compact(self, level):
16         if self.get_level_size(level) > self.level_max_bytes[level]:
17             # Pick overlapping SSTables from level and level+1
18             files_to_compact = self.pick_compaction_files(level)
19             
20             # Merge sort all entries, removing duplicates
21             merged_data = self.merge_sort_files(files_to_compact)
22             
23             # Write new SSTables to level+1
24             new_sstables = self.write_sstables(merged_data, level + 1)
25             
26             # Delete old files
27             self.delete_files(files_to_compact)

Pros: Minimizes read amplification (each level has non-overlapping key ranges) Cons: Higher write amplification (data gets rewritten multiple times)

Size-Tiered Compaction (Cassandra, ScyllaDB)

python

1 # Simplified size-tiered compaction
2 class SizeTieredCompaction:
3     def __init__(self):
4         self.min_threshold = 4  # Minimum files of similar size
5         self.max_threshold = 32 # Maximum files to compact at once
6         
7     def find_compaction_candidates(self, sstables):
8         # Group SSTables by size (within 50% of each other)
9         buckets = self.bucket_by_size(sstables)
10         
11         for bucket in buckets:
12             if self.min_threshold <= len(bucket) <= self.max_threshold:
13                 return bucket[:self.max_threshold]
14         
15         return None
16     
17     def compact(self):
18         candidates = self.find_compaction_candidates(self.all_sstables)
19         if candidates:
20             merged = self.merge_sstables(candidates)
21             self.write_new_sstable(merged)
22             self.delete_old_sstables(candidates)

Pros: Lower write amplification Cons: Higher read amplification, temporary space usage

Time-Window Compaction (Time-Series Optimized)

python

1 # Time-window compaction for time-series data
2 class TimeWindowCompaction:
3     def __init__(self, window_size=timedelta(hours=1)):
4         self.window_size = window_size
5         
6     def compact(self):
7         # Group SSTables by time window
8         windows = defaultdict(list)
9         for sstable in self.all_sstables:
10             window = self.get_time_window(sstable.min_timestamp)
11             windows[window].append(sstable)
12         
13         # Compact each window separately
14         for window, sstables in windows.items():
15             if self.should_compact_window(window, sstables):
16                 self.compact_window(sstables)

Pros: Perfect for time-series data where you query recent data more often Cons: Not suitable for random access patterns

Real-World LSM Trees in Action

Let's look at how different databases use LSM trees and what makes each unique:

RocksDB: The Swiss Army Knife

Facebook's RocksDB is probably the most widely embedded LSM tree implementation. It's used in everything from MySQL (MyRocks storage engine) to Kafka Streams to blockchain databases.

cpp

1 // Using RocksDB (C++)
2 #include 
3  
4 rocksdb::DB* db;
5 rocksdb::Options options;
6 options.create_if_missing = true;
7  
8 // Tune for write-heavy workload
9 options.write_buffer_size = 64 * 1024 * 1024;  // 64MB MemTable
10 options.max_write_buffer_number = 3;            // 3 MemTables
11 options.level0_file_num_compaction_trigger = 4; // Compact after 4 files
12  
13 rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
14  
15 // Writes are fast
16 db->Put(rocksdb::WriteOptions(), "key1", "value1");
17  
18 // Reads check multiple places
19 std::string value;
20 db->Get(rocksdb::ReadOptions(), "key1", &value);

RocksDB's Secret Weapons:

•Column families (like separate LSM trees within one database)
•Pluggable compaction strategies
•Extensive tuning options (hundreds of knobs)
•Support for transactions and snapshots

Cassandra/ScyllaDB: Distributed LSM Trees

Cassandra takes LSM trees and distributes them across multiple nodes:

java

1 // Cassandra's distributed LSM model
2 public class CassandraLSM {
3     // Each node has its own LSM tree for its partition range
4     private final Map, LSMTree> localTrees;
5     
6     public void write(PartitionKey key, Row row) {
7         Token token = partitioner.getToken(key);
8         Node coordinator = getCoordinator(token);
9         
10         if (coordinator.equals(localNode)) {
11             // Write to local LSM tree
12             LSMTree tree = localTrees.get(getRangeForToken(token));
13             tree.write(key, row);
14             
15             // Replicate to other nodes
16             for (Node replica : getReplicaNodes(token)) {
17                 replica.replicateWrite(key, row);
18             }
19         }
20     }
21 }

Distributed LSM Advantages:

•Each node only compacts its own data
•Writes are distributed across the cluster
•Natural sharding by partition key

LevelDB/Chrome: LSM Trees in Your Browser

Yes, Chrome uses an LSM tree (LevelDB) for IndexedDB:

javascript

1 // Your browser is running an LSM tree right now!
2 const request = indexedDB.open("MyDatabase", 1);
3  
4 request.onsuccess = (event) => {
5     const db = event.target.result;
6     const transaction = db.transaction(["users"], "readwrite");
7     const store = transaction.objectStore("users");
8     
9     // This write goes into LevelDB's MemTable
10     store.put({ id: 1, name: "Alice", email: "alice@example.com" });
11     
12     // This read might check MemTable + multiple SSTables
13     const getRequest = store.get(1);
14 };

LSM Trees vs B-Trees: The Showdown

Let's compare LSM trees with traditional B-trees across various dimensions:

Characteristic	LSM Tree	B-Tree	Winner
Write Performance	Sequential writes only	Random writes required	LSM Tree 🏆
Read Performance	May check multiple files	Direct path to data	B-Tree 🏆
Space Efficiency	Temporary duplication	No duplication	B-Tree 🏆
SSD Friendliness	Reduces write wear	Causes write amplification	LSM Tree 🏆
Crash Recovery	Simple (replay log)	Complex (fix tree structure)	LSM Tree 🏆
Compression	Excellent (immutable files)	Limited (in-place updates)	LSM Tree 🏆

Performance in Numbers

Here's what this means in practice:

python

1 # Benchmark comparison (conceptual)
2 class DatabaseBenchmark:
3     def __init__(self):
4         self.lsm_db = RocksDB()
5         self.btree_db = PostgreSQL()
6         
7     def benchmark_writes(self, num_records=1_000_000):
8         # LSM Tree: ~500,000 writes/second
9         start = time.time()
10         for i in range(num_records):
11             self.lsm_db.put(f"key_{i}", f"value_{i}")
12         lsm_time = time.time() - start
13         
14         # B-Tree: ~50,000 writes/second (with durability)
15         start = time.time()
16         for i in range(num_records):
17             self.btree_db.insert("table", {"key": f"key_{i}", "value": f"value_{i}"})
18         btree_time = time.time() - start
19         
20         print(f"LSM writes/sec: {num_records / lsm_time:,.0f}")
21         print(f"B-tree writes/sec: {num_records / btree_time:,.0f}")
22         
23     def benchmark_reads(self, num_reads=100_000):
24         # LSM Tree: ~200,000 reads/second (with bloom filters)
25         # B-Tree: ~300,000 reads/second (direct index lookup)
26         pass

✨

The 10x Rule: LSM trees often provide 10x better write performance than B-trees, while B-trees typically provide 1.5x better read performance. For write-heavy workloads, this trade-off is a no-brainer.

When to Use LSM Tree Databases

LSM trees excel in specific scenarios:

✅ Perfect Use Cases

1. Time-Series Data

python

1 # IoT sensor data, metrics, logs
2 sensor_data = {
3     "timestamp": 1699564800,
4     "sensor_id": "temp_sensor_42",
5     "temperature": 23.5,
6     "humidity": 45.2
7 }
8 # Append-only pattern = LSM heaven
9 tsdb.write(sensor_data)

2. Write-Heavy Workloads

•Event streaming (Kafka uses RocksDB for state stores)
•User activity tracking
•Log aggregation
•Real-time analytics ingestion

3. High-Throughput Applications

java

1 // Message queue backed by LSM tree
2 public class HighThroughputQueue {
3     private final RocksDB db;
4     private final AtomicLong writeCounter = new AtomicLong();
5     
6     public void enqueue(Message message) {
7         long id = writeCounter.incrementAndGet();
8         byte[] key = ByteBuffer.allocate(8).putLong(id).array();
9         db.put(key, serialize(message));
10         // Can handle millions of messages per second
11     }
12 }

4. Large Datasets with Sequential Access

•Blockchain storage (each block appends)
•Commit logs
•Audit trails

❌ Poor Use Cases

1. Random Reads

sql

1 -- This query pattern is painful for LSM trees
2 SELECT * FROM users WHERE user_id = ? -- Random key lookup
3 -- LSM tree might check 5-10 SSTables
4 -- B-tree goes directly to the leaf page

2. Frequent Updates to Existing Records

python

1 # LSM trees don't update in place
2 # This creates multiple versions
3 for _ in range(1000):
4     db.put("hot_key", new_value)  # Creates 1000 versions!
5     # All versions exist until compaction

3. Strong Consistency Requirements

•Bank account balances (need immediate consistency)
•Inventory systems (can't oversell)
•Traditional OLTP workloads

4. Small Datasets

•If your data fits in memory, LSM overhead isn't worth it
•B-trees are simpler for small data

Tuning LSM Trees: The Art and Science

The dirty secret of LSM trees is that they require tuning to perform well. Here are the key knobs:

Write Amplification vs Read Amplification

python

1 class LSMTuning:
2     def __init__(self):
3         # Larger MemTable = fewer SSTables = better reads, worse memory usage
4         self.memtable_size = 256 * MB
5         
6         # More levels = better read performance, worse write amplification
7         self.num_levels = 7
8         
9         # Aggressive compaction = better reads, worse writes
10         self.compaction_style = "leveled"  # or "tiered" for write-optimized
11         
12         # Bloom filter tuning
13         self.bloom_bits_per_key = 10  # ~1% false positive rate

Real-World Tuning Example

yaml

1 # Cassandra tuning for write-heavy workload
2 compaction:
3   class: SizeTieredCompactionStrategy
4   min_threshold: 4
5   max_threshold: 32
6  
7 memtable_flush_writers: 4
8 memtable_heap_space_in_mb: 2048
9 memtable_offheap_space_in_mb: 2048
10  
11 # RocksDB tuning for read-heavy workload
12 rocksdb_options:
13   level_compaction_dynamic_level_bytes: true
14   max_bytes_for_level_base: 512MB
15   target_file_size_base: 64MB
16   block_cache_size: 8GB
17   bloom_locality: 1

⚠️

The Tuning Trap: Don't over-tune! Start with defaults, measure your actual workload, then tune based on real bottlenecks. Most applications work fine with default settings.

The Future of LSM Trees

LSM trees continue to evolve:

Hybrid Approaches

Some databases are combining LSM trees with other structures:

cpp

1 // TiDB uses LSM tree for recent data, B-tree for historical
2 class HybridStorage {
3     LSMTree recent_data;    // Last 24 hours
4     BTree historical_data;  // Older than 24 hours
5     
6     void write(Key k, Value v) {
7         recent_data.put(k, v);
8         // Background job moves to B-tree after 24 hours
9     }
10 };

Hardware Acceleration

•Persistent Memory: Intel Optane enables larger MemTables
•Computational Storage: Offload compaction to smart SSDs
•GPU Acceleration: Parallel merge operations during compaction

Learned Indexes

Research into using machine learning to predict which SSTable contains a key:

python

1 # Instead of checking bloom filters, use ML model
2 class LearnedLSM:
3     def __init__(self):
4         self.model = train_key_distribution_model()
5         
6     def read(self, key):
7         # Model predicts which SSTable likely contains the key
8         predicted_sstable = self.model.predict(key)
9         
10         # Check predicted SSTable first
11         if result := predicted_sstable.get(key):
12             return result
13             
14         # Fall back to regular search
15         return self.regular_search(key)

Conclusion: LSM Trees Are Eating the Database World

LSM trees represent a fundamental shift in database design philosophy: instead of optimizing for the general case, optimize for the common case (writes) and make the uncommon case (reads) good enough.

💡

Key Takeaways:

•LSM trees trade read performance for write performance - a winning trade for many modern workloads
•They turn random writes into sequential writes - crucial for SSD longevity and performance
•Compaction strategies determine the exact trade-offs - choose based on your workload
•They're not a silver bullet - B-trees still win for read-heavy OLTP workloads

The explosion of LSM tree adoption isn't an accident. In a world of event streams, time-series data, log aggregation, and high-velocity data ingestion, LSM trees provide exactly what modern applications need: the ability to sustainably handle massive write throughput without breaking the bank on hardware.

So next time you're evaluating databases and see "LSM tree-based storage engine," you'll know exactly what that means: blazing fast writes, pretty good reads, and a design philosophy that embraces the realities of modern hardware and workloads.

Time to go compact some SSTables. These blog posts don't write themselves to disk, you know.

1	`// Simplified MemTable structure (conceptual)`
2	`class MemTable {`
3	`private:`
4	`std::map data; // Often a skip list in practice`
5	`std::atomic size;`
6
7	`public:`
8	`void put(const Key& key, const Value& value) {`
9	`data[key] = value;`
10	`size += key.size() + value.size();`
11
12	`if (size > threshold) {`
13	`flushToDisk(); // Trigger SSTable creation`
14	`}`
15	`}`
16
17	`optional get(const Key& key) {`
18	`auto it = data.find(key);`
19	`return it != data.end() ? it->second : nullopt;`
20	`}`
21	`};`

1	`# Conceptual SSTable structure`
2	`class SSTable:`
3	`def __init__(self, data):`
4	`self.data_blocks = self._create_data_blocks(data)`
5	`self.index = self._build_index()`
6	`self.bloom_filter = self._build_bloom_filter()`
7	`self.metadata = {`
8	`'min_key': min(data.keys()),`
9	`'max_key': max(data.keys()),`
10	`'num_entries': len(data),`
11	`'timestamp': time.time()`
12	`}`
13
14	`def get(self, key):`
15	`# Quick rejection using bloom filter`
16	`if not self.bloom_filter.might_contain(key):`
17	`return None`
18
19	`# Quick rejection using metadata`
20	`if key < self.metadata['min_key'] or key > self.metadata['max_key']:`
21	`return None`
22
23	`# Binary search in index to find block`
24	`block_offset = self._binary_search_index(key)`
25
26	`# Read only the relevant block`
27	`block = self._read_block(block_offset)`
28	`return block.get(key)`

1	`# Simplified leveled compaction logic`
2	`class LeveledCompaction:`
3	`def __init__(self):`
4	`self.levels = [[] for _ in range(7)] # L0 to L6`
5	`self.level_max_bytes = [`
6	`10 * MB, # L0: 10MB`
7	`100 * MB, # L1: 100MB`
8	`1 * GB, # L2: 1GB`
9	`10 * GB, # L3: 10GB`
10	`100 * GB, # L4: 100GB`
11	`1 * TB, # L5: 1TB`
12	`10 * TB # L6: 10TB`
13	`]`
14
15	`def compact(self, level):`
16	`if self.get_level_size(level) > self.level_max_bytes[level]:`
17	`# Pick overlapping SSTables from level and level+1`
18	`files_to_compact = self.pick_compaction_files(level)`
19
20	`# Merge sort all entries, removing duplicates`
21	`merged_data = self.merge_sort_files(files_to_compact)`
22
23	`# Write new SSTables to level+1`
24	`new_sstables = self.write_sstables(merged_data, level + 1)`
25
26	`# Delete old files`
27	`self.delete_files(files_to_compact)`

1	`# Simplified size-tiered compaction`
2	`class SizeTieredCompaction:`
3	`def __init__(self):`
4	`self.min_threshold = 4 # Minimum files of similar size`
5	`self.max_threshold = 32 # Maximum files to compact at once`
6
7	`def find_compaction_candidates(self, sstables):`
8	`# Group SSTables by size (within 50% of each other)`
9	`buckets = self.bucket_by_size(sstables)`
10
11	`for bucket in buckets:`
12	`if self.min_threshold <= len(bucket) <= self.max_threshold:`
13	`return bucket[:self.max_threshold]`
14
15	`return None`
16
17	`def compact(self):`
18	`candidates = self.find_compaction_candidates(self.all_sstables)`
19	`if candidates:`
20	`merged = self.merge_sstables(candidates)`
21	`self.write_new_sstable(merged)`
22	`self.delete_old_sstables(candidates)`

1	`# Time-window compaction for time-series data`
2	`class TimeWindowCompaction:`
3	`def __init__(self, window_size=timedelta(hours=1)):`
4	`self.window_size = window_size`
5
6	`def compact(self):`
7	`# Group SSTables by time window`
8	`windows = defaultdict(list)`
9	`for sstable in self.all_sstables:`
10	`window = self.get_time_window(sstable.min_timestamp)`
11	`windows[window].append(sstable)`
12
13	`# Compact each window separately`
14	`for window, sstables in windows.items():`
15	`if self.should_compact_window(window, sstables):`
16	`self.compact_window(sstables)`

1	`// Using RocksDB (C++)`
2	`#include`
3
4	`rocksdb::DB* db;`
5	`rocksdb::Options options;`
6	`options.create_if_missing = true;`
7
8	`// Tune for write-heavy workload`
9	`options.write_buffer_size = 64 * 1024 * 1024; // 64MB MemTable`
10	`options.max_write_buffer_number = 3; // 3 MemTables`
11	`options.level0_file_num_compaction_trigger = 4; // Compact after 4 files`
12
13	`rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);`
14
15	`// Writes are fast`
16	`db->Put(rocksdb::WriteOptions(), "key1", "value1");`
17
18	`// Reads check multiple places`
19	`std::string value;`
20	`db->Get(rocksdb::ReadOptions(), "key1", &value);`

1	`// Cassandra's distributed LSM model`
2	`public class CassandraLSM {`
3	`// Each node has its own LSM tree for its partition range`
4	`private final Map, LSMTree> localTrees;`
5
6	`public void write(PartitionKey key, Row row) {`
7	`Token token = partitioner.getToken(key);`
8	`Node coordinator = getCoordinator(token);`
9
10	`if (coordinator.equals(localNode)) {`
11	`// Write to local LSM tree`
12	`LSMTree tree = localTrees.get(getRangeForToken(token));`
13	`tree.write(key, row);`
14
15	`// Replicate to other nodes`
16	`for (Node replica : getReplicaNodes(token)) {`
17	`replica.replicateWrite(key, row);`
18	`}`
19	`}`
20	`}`
21	`}`

1	`// Your browser is running an LSM tree right now!`
2	`const request = indexedDB.open("MyDatabase", 1);`
3
4	`request.onsuccess = (event) => {`
5	`const db = event.target.result;`
6	`const transaction = db.transaction(["users"], "readwrite");`
7	`const store = transaction.objectStore("users");`
8
9	`// This write goes into LevelDB's MemTable`
10	`store.put({ id: 1, name: "Alice", email: "alice@example.com" });`
11
12	`// This read might check MemTable + multiple SSTables`
13	`const getRequest = store.get(1);`
14	`};`

1	`# Benchmark comparison (conceptual)`
2	`class DatabaseBenchmark:`
3	`def __init__(self):`
4	`self.lsm_db = RocksDB()`
5	`self.btree_db = PostgreSQL()`
6
7	`def benchmark_writes(self, num_records=1_000_000):`
8	`# LSM Tree: ~500,000 writes/second`
9	`start = time.time()`
10	`for i in range(num_records):`
11	`self.lsm_db.put(f"key_{i}", f"value_{i}")`
12	`lsm_time = time.time() - start`
13
14	`# B-Tree: ~50,000 writes/second (with durability)`
15	`start = time.time()`
16	`for i in range(num_records):`
17	`self.btree_db.insert("table", {"key": f"key_{i}", "value": f"value_{i}"})`
18	`btree_time = time.time() - start`
19
20	`print(f"LSM writes/sec: {num_records / lsm_time:,.0f}")`
21	`print(f"B-tree writes/sec: {num_records / btree_time:,.0f}")`
22
23	`def benchmark_reads(self, num_reads=100_000):`
24	`# LSM Tree: ~200,000 reads/second (with bloom filters)`
25	`# B-Tree: ~300,000 reads/second (direct index lookup)`
26	`pass`

1	`# IoT sensor data, metrics, logs`
2	`sensor_data = {`
3	`"timestamp": 1699564800,`
4	`"sensor_id": "temp_sensor_42",`
5	`"temperature": 23.5,`
6	`"humidity": 45.2`
7	`}`
8	`# Append-only pattern = LSM heaven`
9	`tsdb.write(sensor_data)`

1	`// Message queue backed by LSM tree`
2	`public class HighThroughputQueue {`
3	`private final RocksDB db;`
4	`private final AtomicLong writeCounter = new AtomicLong();`
5
6	`public void enqueue(Message message) {`
7	`long id = writeCounter.incrementAndGet();`
8	`byte[] key = ByteBuffer.allocate(8).putLong(id).array();`
9	`db.put(key, serialize(message));`
10	`// Can handle millions of messages per second`
11	`}`
12	`}`

1	`-- This query pattern is painful for LSM trees`
2	`SELECT * FROM users WHERE user_id = ? -- Random key lookup`
3	`-- LSM tree might check 5-10 SSTables`
4	`-- B-tree goes directly to the leaf page`

1	`# LSM trees don't update in place`
2	`# This creates multiple versions`
3	`for _ in range(1000):`
4	`db.put("hot_key", new_value) # Creates 1000 versions!`
5	`# All versions exist until compaction`

1	`class LSMTuning:`
2	`def __init__(self):`
3	`# Larger MemTable = fewer SSTables = better reads, worse memory usage`
4	`self.memtable_size = 256 * MB`
5
6	`# More levels = better read performance, worse write amplification`
7	`self.num_levels = 7`
8
9	`# Aggressive compaction = better reads, worse writes`
10	`self.compaction_style = "leveled" # or "tiered" for write-optimized`
11
12	`# Bloom filter tuning`
13	`self.bloom_bits_per_key = 10 # ~1% false positive rate`

1	`# Cassandra tuning for write-heavy workload`
2	`compaction:`
3	`class: SizeTieredCompactionStrategy`
4	`min_threshold: 4`
5	`max_threshold: 32`
6
7	`memtable_flush_writers: 4`
8	`memtable_heap_space_in_mb: 2048`
9	`memtable_offheap_space_in_mb: 2048`
10
11	`# RocksDB tuning for read-heavy workload`
12	`rocksdb_options:`
13	`level_compaction_dynamic_level_bytes: true`
14	`max_bytes_for_level_base: 512MB`
15	`target_file_size_base: 64MB`
16	`block_cache_size: 8GB`
17	`bloom_locality: 1`

1	`// TiDB uses LSM tree for recent data, B-tree for historical`
2	`class HybridStorage {`
3	`LSMTree recent_data; // Last 24 hours`
4	`BTree historical_data; // Older than 24 hours`
5
6	`void write(Key k, Value v) {`
7	`recent_data.put(k, v);`
8	`// Background job moves to B-tree after 24 hours`
9	`}`
10	`};`

1	`# Instead of checking bloom filters, use ML model`
2	`class LearnedLSM:`
3	`def __init__(self):`
4	`self.model = train_key_distribution_model()`
5
6	`def read(self, key):`
7	`# Model predicts which SSTable likely contains the key`
8	`predicted_sstable = self.model.predict(key)`
9
10	`# Check predicted SSTable first`
11	`if result := predicted_sstable.get(key):`
12	`return result`
13
14	`# Fall back to regular search`
15	`return self.regular_search(key)`

WTF is an LSM Tree? The Secret Sauce Behind Modern Databases