LuaJIT: The One-Man Compiler That Embarrassed the Industry

Imagine a scripting language that's 50-100x faster than Python, often within 2x of C performance, with a foreign function interface so elegant it makes ctypes look like cave paintings. Now imagine this was all created by one person, working alone, who then disappeared—leaving behind a codebase so sophisticated that some of the world's best compiler engineers struggle to understand it.

This is the story of LuaJIT and its creator Mike Pall, a tale that exposes uncomfortable truths about software engineering, the limits of corporate development, and what happens when genius-level work becomes too clever for its own good.

✨

LuaJIT isn't just fast—it pioneered compiler techniques that wouldn't appear in mainstream JITs for years. Its trace compiler was doing optimizations in 2009 that V8 only started implementing in 2017. Yet today, it's slowly dying because no one can maintain it.

The Shocking Numbers That Started Everything

Let me show you why LuaJIT matters with some benchmarks that still make people do a double-take:

lua

1 -- Naive Fibonacci in different languages
2 function fib(n)
3     if n < 2 then return n end
4     return fib(n-1) + fib(n-2)
5 end
6  
7 -- Benchmark results for fib(40):
8 -- CPython 3.9:      ~32 seconds
9 -- Ruby 3.0:         ~19 seconds
10 -- PHP 8.0:          ~11 seconds
11 -- JavaScript V8:    ~1.2 seconds
12 -- Lua 5.4:          ~15 seconds
13 -- LuaJIT 2.1:       ~0.8 seconds
14 -- C (gcc -O2):      ~0.5 seconds

LuaJIT is beating V8—Google's JavaScript engine with a massive team and billions in funding—with a JIT compiler written by one person. But raw recursion isn't even where LuaJIT shines brightest:

lua

1 -- Numeric computation benchmark
2 local function mandelbrot(N)
3     local width, height, limit2 = N, N, 4.0
4     local iter = 50
5     local bits, bit = 0, 128
6     
7     for y = 0, height - 1 do
8         for x = 0, width - 1 do
9             local Zr, Zi, Cr, Ci = 0.0, 0.0, 0.0, 0.0
10             Cr = 2.0 * x / width - 1.5
11             Ci = 2.0 * y / height - 1.0
12             
13             local i = iter
14             repeat
15                 local Tr = Zr * Zr - Zi * Zi + Cr
16                 local Ti = 2.0 * Zr * Zi + Ci
17                 Zr, Zi = Tr, Ti
18                 i = i - 1
19             until (Zr * Zr + Zi * Zi > limit2) or (i == 0)
20             
21             if i == 0 then bits = bits + bit end
22             if bit == 1 then
23                 io.write(string.char(bits))
24                 bits, bit = 0, 128
25             else
26                 bit = bit / 2
27             end
28         end
29     end
30 end
31  
32 -- Performance (smaller is better):
33 -- C (gcc -O3):        0.146s
34 -- LuaJIT:             0.172s (only 18% slower!)
35 -- Java:               0.291s
36 -- JavaScript V8:      0.385s
37 -- Lua 5.4:            8.937s (52x slower than LuaJIT!)

⚠️

These aren't cherry-picked benchmarks. LuaJIT consistently delivers near-C performance for numeric code, often beating Java and always destroying other dynamic languages. The Computer Language Benchmarks Game had to create separate categories because LuaJIT was making other scripting languages look bad.

The Man Behind the Magic: Mike Pall

Mike Pall is the John Carmack of compiler engineering—a programmer so far ahead of the curve that his work seems like magic. But unlike Carmack, who became a celebrity, Pall remained obscure, communicating mainly through mailing list posts that read like doctoral dissertations.

His background was in assembly language and low-level optimization. Before LuaJIT, he'd contributed to various open source projects, always focused on performance. When he discovered Lua in 2005, he saw an opportunity: Lua was small, clean, and had clear semantics—perfect for a JIT compiler.

What happened next was unprecedented. Working entirely alone, without corporate backing, Pall created:

•LuaJIT 1.x (2005-2012): A straightforward JIT that was already faster than most alternatives
•LuaJIT 2.0 (2009-2017): A complete rewrite using trace compilation, achieving near-C performance
•The FFI (2011): A foreign function interface so elegant it redefined what was possible

Then, around 2017, he largely disappeared. The git commits slowed, then stopped. The mailing list went quiet. LuaJIT development essentially froze.

Trace Compilation: The Secret Weapon

To understand why LuaJIT is so fast, you need to understand trace compilation—a technique so powerful that it seems like cheating.

Traditional JIT compilers work on methods:

But trace compilers work on execution paths:

Here's why this is genius:

lua

1 -- Consider this code:
2 function process_data(items)
3     local sum = 0
4     for i = 1, #items do
5         if items[i] > 0 then  -- Branch
6             if items[i] < 100 then  -- Another branch
7                 sum = sum + items[i]
8             else
9                 sum = sum + 100
10             end
11         end
12     end
13     return sum
14 end
15  
16 -- Method JIT must compile all possible paths
17 -- Trace JIT only compiles the path actually taken!

When LuaJIT detects a hot loop, it doesn't compile the function—it records exactly what happens:

-- If your data is usually positive numbers under 100,
-- LuaJIT records THIS trace:
TRACE 1 start process_data:2
0001  GGET     2   0      ; "items"
0002  LEN      3   2
0003  LOOP     4 => 0013
0004  GGET     5   0      ; "items" 
0005  TGET     6   5   4  ; items[i]
0006  KSHORT   7   0      ; 0
0007  ISGT     6   7      ; Guard: items[i] > 0
0008  KSHORT   8   100    ; 100
0009  ISLT     6   8      ; Guard: items[i] < 100
0010  ADD      1   1   6  ; sum = sum + items[i]
0011  ADDVN    4   4   0  ; i = i + 1
0012  JMP      4 => 0003
0013  TRACE 1 stop -> loop

The Guards are the magic—they check if we're still on the recorded path. If a guard fails, we fall back to the interpreter and maybe record a new trace.

The FFI: When Zero Cost Actually Means Zero

LuaJIT's Foreign Function Interface is perhaps its most revolutionary feature. While Python's ctypes makes you want to cry, LuaJIT's FFI is so simple it seems like it shouldn't work:

lua

1 -- This is ALL you need to call C functions:
2 local ffi = require("ffi")
3 ffi.cdef[[
4     typedef struct { double x, y; } point_t;
5     double sqrt(double x);
6     double atan2(double y, double x);
7 ]]
8  
9 -- Now just use it!
10 local function distance(p1, p2)
11     local dx = p2.x - p1.x
12     local dy = p2.y - p1.y
13     return math.sqrt(dx*dx + dy*dy)  -- Calls C's sqrt directly!
14 end
15  
16 -- Creating C structs is trivial
17 local p1 = ffi.new("point_t", {x = 10, y = 20})
18 local p2 = ffi.new("point_t", {x = 30, y = 40})
19 print(distance(p1, p2))  -- 28.284271247462

But here's the kicker—this has ZERO overhead. The JIT compiler inlines everything:

lua

1 -- This Lua code:
2 local function add_arrays(a, b, c, n)
3     for i = 0, n-1 do
4         c[i] = a[i] + b[i]
5     end
6 end
7  
8 -- Compiles to THIS machine code:
9 -- movsd xmm0, [rsi+rax*8]   ; Load a[i]
10 -- addsd xmm0, [rdx+rax*8]   ; Add b[i]
11 -- movsd [rcx+rax*8], xmm0   ; Store to c[i]
12 -- inc rax                   ; i++
13 -- cmp rax, rdi              ; i < n?
14 -- jl loop                   ; Loop if true
15  
16 -- That's literally what a C compiler would generate!

Real-World Domination

LuaJIT's performance made it the choice for performance-critical applications everywhere:

1. OpenResty: Nginx on Steroids

OpenResty embeds LuaJIT into Nginx, creating a web server that can handle application logic at wire speed:

lua

1 -- This handles 1M+ requests/second:
2 local redis = require "resty.redis"
3 local cjson = require "cjson"
4  
5 local red = redis:new()
6 red:connect("127.0.0.1", 6379)
7  
8 local user_id = ngx.var.arg_id
9 local user_data = red:get("user:" .. user_id)
10  
11 if user_data then
12     ngx.header.content_type = "application/json"
13     ngx.say(user_data)
14 else
15     ngx.status = 404
16     ngx.say(cjson.encode({error = "User not found"}))
17 end

Cloudflare, Alibaba, and Tumblr all run OpenResty at massive scale. Cloudflare handles 25 million HTTP requests per second using LuaJIT for edge computing.

2. Game Development: From Angry Birds to World of Warcraft

Every World of Warcraft addon runs on Lua, but games using LuaJIT saw 10-50x performance improvements:

lua

1 -- LÖVE2D game engine with LuaJIT
2 function love.update(dt)
3     -- This particle system can handle 100K+ particles at 60 FPS
4     for i = 1, #particles do
5         local p = particles[i]
6         p.vy = p.vy + gravity * dt
7         p.x = p.x + p.vx * dt
8         p.y = p.y + p.vy * dt
9         p.life = p.life - dt
10         
11         if p.life <= 0 then
12             table.remove(particles, i)
13         end
14     end
15 end

3. Scientific Computing: When Python is Too Slow

Scientists discovered LuaJIT could replace Python+NumPy for many tasks:

lua

1 -- SciLua: Scientific computing in LuaJIT
2 local ffi = require("ffi")
3 local C = ffi.C
4  
5 ffi.cdef[[
6     void dgemm_(char* transa, char* transb, int* m, int* n, int* k,
7                 double* alpha, double* a, int* lda, double* b, int* ldb,
8                 double* beta, double* c, int* ldc);
9 ]]
10  
11 -- Direct BLAS calls with zero overhead!
12 local function matrix_multiply(A, B, C, m, n, k)
13     local alpha, beta = 1.0, 0.0
14     local transa, transb = 'N', 'N'
15     C.dgemm_(transa, transb, m, n, k, 
16              ffi.new("double[1]", alpha), A, m, B, k,
17              ffi.new("double[1]", beta), C, m)
18 end

The Architecture: How Mike Pall Did It

LuaJIT's architecture is a masterclass in compiler design. Here's the high-level view:

1. The Bytecode: Designed for Speed

LuaJIT uses a register-based bytecode (unlike Lua's stack-based), with each instruction carefully designed to map efficiently to machine code:

-- Lua source:
local x = a + b * c

-- LuaJIT bytecode:
GGET  0  "a"      ; Load global 'a' into register 0
GGET  1  "b"      ; Load global 'b' into register 1  
GGET  2  "c"      ; Load global 'c' into register 2
MUL   1  1  2     ; r1 = r1 * r2
ADD   0  0  1     ; r0 = r0 + r1

2. The Trace Recorder: Watching Your Code Run

When the interpreter detects a hot loop, it switches to recording mode:

1 // Simplified trace recording logic
2 void record_trace() {
3     while (recording) {
4         BCIns ins = *pc++;  // Get next bytecode
5         
6         switch (bc_op(ins)) {
7         case BC_ADD:
8             emit_ir(IR_ADD, bc_a(ins), bc_b(ins), bc_c(ins));
9             break;
10         case BC_LOOP:
11             if (++loop_count > HOTLOOP_THRESHOLD) {
12                 end_trace();
13                 compile_trace();
14             }
15             break;
16         // ... guards for type checks, bounds checks, etc
17         }
18     }
19 }

3. The IR: Static Single Assignment Form

LuaJIT converts bytecode to an SSA-based intermediate representation:

-- For this Lua code:
local sum = 0
for i = 1, n do
    sum = sum + arr[i]
end

-- LuaJIT generates this IR:
0001 >  int SLOAD  #2    CI  ; n
0002 >  int LE     0001  +2147483646
0003    int SLOAD  #1    CI  ; i
0004    p32 AREF   P[0x400] 0003
0005 >  num ALOAD  0004
0006    num SLOAD  #3        ; sum
0007 +  num ADD    0006  0005
0008 +  int ADD    0003  +1
0009 >  int LE     0008  0001
0010 ------ LOOP ------------
0011 >  int PHI    0003  0008  ; Loop variable
0012 >  num PHI    0006  0007  ; Accumulator

4. The Optimizations: Where the Magic Happens

LuaJIT applies sophisticated optimizations that rival commercial compilers. Let's dive deep into each one to understand why they're so powerful:

Allocation Sinking: Objects That Never Exist

This is perhaps LuaJIT's most impressive trick. Consider this code:

lua

1 -- Normal Lua creates a temporary table here
2 function vector_add(x1, y1, x2, y2)
3     local v1 = {x = x1, y = y1}  -- Allocation?
4     local v2 = {x = x2, y = y2}  -- Another allocation?
5     return v1.x + v2.x, v1.y + v2.y
6 end
7  
8 -- LuaJIT's trace compiler sees through this!
9 -- After optimization, it compiles to:
10 -- return x1 + x2, y1 + y2
11 -- The tables NEVER GET ALLOCATED

How does this work? LuaJIT tracks object allocations and their uses. If an object:

•Doesn't escape the trace (isn't returned or stored globally)
•Is only used for field access
•Has fields that can be computed at compile time

Then LuaJIT "sinks" the allocation—it never happens! The fields become virtual registers:

-- IR before allocation sinking:
0001  TABLE  t1
0002  STORE  t1.x  x1
0003  STORE  t1.y  y1
0004  TABLE  t2
0005  STORE  t2.x  x2
0006  STORE  t2.y  y2
0007  LOAD   r1  t1.x
0008  LOAD   r2  t2.x
0009  ADD    r3  r1  r2

-- IR after allocation sinking:
0001  ADD    r3  x1  x2  -- Tables disappeared!

Common Subexpression Elimination (CSE): Never Compute Twice

LuaJIT aggressively eliminates redundant computations:

lua

1 function compute_distance(points, i, j)
2     local dx = points[i].x - points[j].x
3     local dy = points[i].y - points[j].y
4     
5     -- These array lookups are identical to above
6     local norm_x = (points[i].x - points[j].x) / distance
7     local norm_y = (points[i].y - points[j].y) / distance
8     
9     return norm_x, norm_y
10 end
11  
12 -- LuaJIT recognizes the repeated subexpressions:
13 -- points[i].x - points[j].x  (computed once)
14 -- points[i].y - points[j].y  (computed once)

The CSE pass builds a hash table of all computed expressions. When it sees a duplicate, it reuses the previous result:

-- Before CSE:
0001  TGET   r1  points  i     -- points[i]
0002  FLOAD  r2  r1.x          -- .x
0003  TGET   r3  points  j     -- points[j]
0004  FLOAD  r4  r3.x          -- .x
0005  SUB    r5  r2  r4        -- dx
...
0010  TGET   r6  points  i     -- points[i] AGAIN
0011  FLOAD  r7  r6.x          -- .x AGAIN
0012  TGET   r8  points  j     -- points[j] AGAIN
0013  FLOAD  r9  r8.x          -- .x AGAIN
0014  SUB    r10 r7  r9        -- Same subtraction!

-- After CSE:
0001  TGET   r1  points  i
0002  FLOAD  r2  r1.x
0003  TGET   r3  points  j
0004  FLOAD  r4  r3.x
0005  SUB    r5  r2  r4
...
0010  ; Instructions 10-14 eliminated, reuse r5!

Loop Invariant Code Motion (LICM): Don't Repeat in Loops

This optimization moves calculations that don't change out of loops:

lua

1 function process_data(data, multiplier, offset)
2     local result = 0
3     local factor = multiplier * 2.5 + offset  -- Loop invariant
4     
5     for i = 1, #data do
6         local adjusted = data[i] * (multiplier * 2.5 + offset)  -- Same calculation!
7         result = result + adjusted
8     end
9     return result
10 end
11  
12 -- LuaJIT hoists the invariant calculation:
13 -- multiplier * 2.5 + offset computed ONCE before loop

LICM uses dominator tree analysis to find expressions that:

•Have operands that don't change in the loop
•Don't have side effects
•Are guaranteed to execute (no guard failures)

-- Before LICM:
LOOP:
  0001  MUL    r1  multiplier  2.5    -- Inside loop
  0002  ADD    r2  r1  offset         -- Inside loop
  0003  TGET   r3  data  i
  0004  MUL    r4  r3  r2
  0005  ADD    result  result  r4
  0006  ITERN  i  => LOOP

-- After LICM:
0001  MUL    r1  multiplier  2.5      -- Moved out!
0002  ADD    r2  r1  offset           -- Moved out!
LOOP:
  0003  TGET   r3  data  i
  0004  MUL    r4  r3  r2             -- Uses pre-computed r2
  0005  ADD    result  result  r4
  0006  ITERN  i  => LOOP

Alias Analysis: Proving Memory Independence

This is crucial for optimization. LuaJIT must prove that memory operations don't interfere:

lua

1 function update_positions(objects, dt)
2     for i = 1, #objects do
3         objects[i].x = objects[i].x + objects[i].vx * dt
4         objects[i].y = objects[i].y + objects[i].vy * dt
5         -- Can we be sure objects[i] didn't change between accesses?
6     end
7 end

LuaJIT's alias analysis tracks:

•Type-based aliasing: Numbers can't alias with tables
•Field-based aliasing: Different fields don't overlap
•Index-based aliasing: Different array indices are independent

-- Alias analysis proves these don't interfere:
STORE  objects[i].x  new_x   -- Can't affect .y or .vx
LOAD   objects[i].y          -- Safe to load
LOAD   objects[i].vx         -- Safe to load

Guard Elimination: Removing Redundant Checks

This is where trace compilation really shines. Guards ensure we stay on the fast path:

lua

1 function sum_positive(arr)
2     local sum = 0
3     for i = 1, #arr do
4         if type(arr[i]) == "number" then  -- Type guard
5             if arr[i] > 0 then              -- Value guard
6                 sum = sum + arr[i]
7             end
8         end
9     end
10     return sum
11 end

But checking the same thing repeatedly is wasteful. LuaJIT eliminates redundant guards:

-- First iteration:
0001  TGET   r1  arr  1
0002  ISNUM  r1            -- Guard: is it a number?
0003  ISPOS  r1            -- Guard: is it positive?
0004  ADD    sum  sum  r1

-- Second iteration (naive):
0005  TGET   r2  arr  2
0006  ISNUM  r2            -- Same type check again?
0007  ISPOS  r2            -- Another value check?

-- After guard elimination:
0005  TGET   r2  arr  2
0006  ; Type guard eliminated - array is homogeneous!
0007  ISPOS  r2            -- Value guard kept (values differ)

Advanced: NaN-Boxing and Type Specialization

LuaJIT stores all Lua values in 64-bit slots using NaN-boxing:

1 // LuaJIT's value representation (simplified)
2 typedef union {
3     double n;        // Numbers stored directly
4     uint64_t u64;    // For type tagging
5 } TValue;
6  
7 // Special NaN patterns encode types:
8 // 0xfff8000000000000 | type | payload
9 // This allows:
10 // - Numbers: stored as-is (fast path)
11 // - Pointers: encoded in NaN space
12 // - Booleans, nil: special NaN values

This enables type specialization:

lua

1 -- Generic addition must handle all types:
2 function add(a, b)
3     return a + b  -- Could be numbers, strings, tables with __add
4 end
5  
6 -- But in a trace where a and b are always numbers:
7 -- LuaJIT generates:
8 //   movsd xmm0, [rax]      -- Load double directly
9 //   addsd xmm0, [rbx]      -- Floating-point add
10 //   movsd [rcx], xmm0      -- Store double
11 // No type checking needed!

The Optimization Pipeline

Here's how all these optimizations work together:

Each optimization enables others:

•Type inference enables guard elimination
•Guard elimination enables CSE (same types guaranteed)
•CSE enables allocation sinking (fewer uses to track)
•LICM reduces register pressure for better allocation

Real-World Impact: Matrix Multiplication

Let's see all optimizations in action:

lua

1 -- Naive matrix multiplication
2 function matmul(A, B, C, n)
3     for i = 1, n do
4         for j = 1, n do
5             local sum = 0
6             for k = 1, n do
7                 sum = sum + A[i][k] * B[k][j]
8             end
9             C[i][j] = sum
10         end
11     end
12 end
13  
14 -- After LuaJIT optimization:
15 -- 1. Type guards eliminated (arrays proven homogeneous)
16 -- 2. Bounds checks eliminated (loop bounds proven safe)
17 -- 3. A[i] hoisted out of k-loop (LICM)
18 -- 4. No temporary tables allocated
19 -- 5. Innermost loop unrolled and vectorized
20  
21 -- Result: 20x faster than interpreted Lua
22 -- Only 2x slower than optimized C (gcc -O3)

The generated assembly is shockingly good:

asm

1 ; Inner loop after all optimizations:
2 .loop:
3     movsd  xmm0, [rsi+rax*8]     ; A[i][k]
4     movsd  xmm1, [rdx+rcx*8]     ; B[k][j]
5     mulsd  xmm0, xmm1             ; multiply
6     addsd  xmm2, xmm0             ; accumulate
7     add    rax, 1                 ; k++
8     cmp    rax, r8                ; k < n?
9     jl     .loop

That's literally what a C compiler would generate. No overhead. No type checks. No bounds checks. Just pure computation.

The Tragedy: Too Clever to Live

Around 2015, cracks began to show. Mike Pall was burning out. The mailing list posts became terser, then stopped. The commit frequency dropped. By 2017, development had essentially ceased.

The problem? LuaJIT is too sophisticated for its own good:

1. The Bus Factor of One

The codebase is ~100,000 lines of dense C and assembly, with architecture-specific backends for x86, x64, ARM, PPC, and MIPS. Comments are sparse. The design lives in Mike Pall's head.

1 /* This is typical LuaJIT code - brilliant but impenetrable */
2 static void asm_fuseahuref(ASMState *as, IRRef ref, int32_t *ofsp, RegSet allow)
3 {
4   IRIns *ir = IR(ref);
5   if (ra_noreg(ir->r)) {
6     if (ir->o == IR_AREF) {
7       if (mayfuse(as, ref)) {
8         if (irref_isk(ir->op2)) {
9           IRRef tab = IR(ir->op1)->op1;
10           int32_t ofs = 8*IR(ir->op2)->i;
11           if (checki16(ofs)) {
12             *ofsp = ofs;
13             return (int32_t)ra_alloc1(as, tab, allow);
14           }
15         }
16       }
17     } else if (ir->o == IR_HREFK) {
18       if (mayfuse(as, ref)) {
19         int32_t ofs = (int32_t)(IR(ir->op2)->op2 * sizeof(Node));
20         if (checki16(ofs)) {
21           *ofsp = ofs;
22           return (int32_t)ra_alloc1(as, ir->op1, allow);
23         }
24       }
25     }
26   }
27   *ofsp = 0;
28   return ra_alloc1(as, ref, allow);
29 }

2. The Maintenance Nightmare

Several groups have tried to maintain LuaJIT:

•OpenResty: Maintains a fork with bug fixes but no major features
•moonjit: Attempted to continue development but stalled
•RaptorJIT: Stripped down to x64-only for maintainability

None have added significant new optimizations. The code is just too complex.

3. The Architecture Trap

Modern CPUs have changed since 2009:

•Spectre/Meltdown make certain optimizations unsafe
•Apple Silicon needs a new backend
•WebAssembly offers new possibilities

But LuaJIT's architecture is so tightly optimized for 2009-era x64 that adapting it is nearly impossible.

Lessons from the Rise and Fall

What LuaJIT Taught Us

•Trace compilation works: JavaScriptCore and SpiderMonkey eventually adopted similar techniques
•FFI can be zero-cost: Influenced Python's cffi and Julia's ccall
•One genius can beat a team: But only temporarily
•Performance matters: Users will adopt an obscure language if it's fast enough

What the Industry Learned (Or Didn't)

The tragedy of LuaJIT is that its techniques are proven to work, but:

•V8 has 100+ engineers but took 8 years to catch up
•Python is 50-100x slower but remains dominant
•New JITs (GraalVM, etc.) ignore trace compilation
•Corporate development rarely produces such innovations

⚠️

The real lesson? We're living in a world where the best compiler technology is abandoned because it's too good. LuaJIT proved that dynamic languages can be fast, but the industry chose slow and maintainable over fast and incomprehensible.

The Code That Could Have Changed Everything

Here's a final example that shows what we lost when LuaJIT development stopped:

lua

1 -- This ray tracer runs at 60 FPS in LuaJIT
2 -- Try this in Python and watch your CPU melt
3  
4 local ffi = require("ffi")
5 local C = ffi.C
6  
7 ffi.cdef[[
8     typedef struct { double x, y, z; } vec3;
9     double sqrt(double);
10     double pow(double, double);
11 ]]
12  
13 local vec3 = ffi.typeof("vec3")
14  
15 local function dot(a, b)
16     return a.x*b.x + a.y*b.y + a.z*b.z
17 end
18  
19 local function normalize(v)
20     local len = C.sqrt(dot(v, v))
21     return vec3(v.x/len, v.y/len, v.z/len)
22 end
23  
24 local function trace(orig, dir, spheres, depth)
25     -- Ray tracing with full reflections
26     -- This inner loop runs millions of times per second
27     local nearest_t = 1e20
28     local nearest_sphere = nil
29     
30     for i = 1, #spheres do
31         local sphere = spheres[i]
32         local oc = vec3(orig.x - sphere.x, orig.y - sphere.y, orig.z - sphere.z)
33         local b = dot(oc, dir)
34         local c = dot(oc, oc) - sphere.radius * sphere.radius
35         local disc = b*b - c
36         
37         if disc > 0 then
38             local t = -b - C.sqrt(disc)
39             if t > 0.001 and t < nearest_t then
40                 nearest_t = t
41                 nearest_sphere = sphere
42             end
43         end
44     end
45     
46     -- ... reflection calculation ...
47     return color
48 end
49  
50 -- This outperforms C++ ray tracers that use virtual functions
51 -- Because LuaJIT inlines EVERYTHING

The Bottom Line: A Cautionary Tale

LuaJIT represents both the pinnacle of compiler engineering and a cautionary tale about sustainable development. One brilliant developer created something that teams of engineers at Google, Microsoft, and Oracle struggled to match. But that same brilliance made it unmaintainable.

Today, LuaJIT still works. It's still fast. It still powers critical infrastructure. But it's frozen in time, a monument to what's possible when genius ignores conventional wisdom—and what happens when that genius walks away.

💡

The Takeaway: LuaJIT proved that dynamic languages don't have to be slow. It showed that one person with deep knowledge can outperform entire teams. But it also showed that sustainable software needs more than brilliance—it needs a community, documentation, and code that mortals can understand.

In the end, LuaJIT is like finding alien technology. We can use it, we can marvel at it, but we can barely comprehend it, much less improve it. It's a reminder that in software, being too far ahead of your time is indistinguishable from failure.

The industry chose mediocrity over brilliance. Python remains 100x slower. JavaScript engines use 100x more memory. But they have teams, documentation, and sustainable development.

Maybe that's the real lesson. Not that we should build like Mike Pall, but that we should build so others can build after us. Because in the end, the code that survives isn't the cleverest—it's the code that others can understand.

Still, one can't help but wonder: what if Mike Pall had kept going? What if LuaJIT had a team? We might be living in a world where dynamic languages are as fast as C, where scripting doesn't mean slow, where genius code could be both brilliant and sustainable.

But that's not the world we live in. Instead, we have LuaJIT: a masterpiece, frozen in amber, too perfect to improve, too complex to maintain, forever fast, forever alone.

1	`-- Naive Fibonacci in different languages`
2	`function fib(n)`
3	`if n < 2 then return n end`
4	`return fib(n-1) + fib(n-2)`
5	`end`
6
7	`-- Benchmark results for fib(40):`
8	`-- CPython 3.9: ~32 seconds`
9	`-- Ruby 3.0: ~19 seconds`
10	`-- PHP 8.0: ~11 seconds`
11	`-- JavaScript V8: ~1.2 seconds`
12	`-- Lua 5.4: ~15 seconds`
13	`-- LuaJIT 2.1: ~0.8 seconds`
14	`-- C (gcc -O2): ~0.5 seconds`

1	`-- Numeric computation benchmark`
2	`local function mandelbrot(N)`
3	`local width, height, limit2 = N, N, 4.0`
4	`local iter = 50`
5	`local bits, bit = 0, 128`
6
7	`for y = 0, height - 1 do`
8	`for x = 0, width - 1 do`
9	`local Zr, Zi, Cr, Ci = 0.0, 0.0, 0.0, 0.0`
10	`Cr = 2.0 * x / width - 1.5`
11	`Ci = 2.0 * y / height - 1.0`
12
13	`local i = iter`
14	`repeat`
15	`local Tr = Zr * Zr - Zi * Zi + Cr`
16	`local Ti = 2.0 * Zr * Zi + Ci`
17	`Zr, Zi = Tr, Ti`
18	`i = i - 1`
19	`until (Zr * Zr + Zi * Zi > limit2) or (i == 0)`
20
21	`if i == 0 then bits = bits + bit end`
22	`if bit == 1 then`
23	`io.write(string.char(bits))`
24	`bits, bit = 0, 128`
25	`else`
26	`bit = bit / 2`
27	`end`
28	`end`
29	`end`
30	`end`
31
32	`-- Performance (smaller is better):`
33	`-- C (gcc -O3): 0.146s`
34	`-- LuaJIT: 0.172s (only 18% slower!)`
35	`-- Java: 0.291s`
36	`-- JavaScript V8: 0.385s`
37	`-- Lua 5.4: 8.937s (52x slower than LuaJIT!)`

1	`-- Consider this code:`
2	`function process_data(items)`
3	`local sum = 0`
4	`for i = 1, #items do`
5	`if items[i] > 0 then -- Branch`
6	`if items[i] < 100 then -- Another branch`
7	`sum = sum + items[i]`
8	`else`
9	`sum = sum + 100`
10	`end`
11	`end`
12	`end`
13	`return sum`
14	`end`
15
16	`-- Method JIT must compile all possible paths`
17	`-- Trace JIT only compiles the path actually taken!`

1	`-- This is ALL you need to call C functions:`
2	`local ffi = require("ffi")`
3	`ffi.cdef[[`
4	`typedef struct { double x, y; } point_t;`
5	`double sqrt(double x);`
6	`double atan2(double y, double x);`
7	`]]`
8
9	`-- Now just use it!`
10	`local function distance(p1, p2)`
11	`local dx = p2.x - p1.x`
12	`local dy = p2.y - p1.y`
13	`return math.sqrt(dxdx + dydy) -- Calls C's sqrt directly!`
14	`end`
15
16	`-- Creating C structs is trivial`
17	`local p1 = ffi.new("point_t", {x = 10, y = 20})`
18	`local p2 = ffi.new("point_t", {x = 30, y = 40})`
19	`print(distance(p1, p2)) -- 28.284271247462`

1	`-- This Lua code:`
2	`local function add_arrays(a, b, c, n)`
3	`for i = 0, n-1 do`
4	`c[i] = a[i] + b[i]`
5	`end`
6	`end`
7
8	`-- Compiles to THIS machine code:`
9	`-- movsd xmm0, [rsi+rax*8] ; Load a[i]`
10	`-- addsd xmm0, [rdx+rax*8] ; Add b[i]`
11	`-- movsd [rcx+rax*8], xmm0 ; Store to c[i]`
12	`-- inc rax ; i++`
13	`-- cmp rax, rdi ; i < n?`
14	`-- jl loop ; Loop if true`
15
16	`-- That's literally what a C compiler would generate!`

1	`-- This handles 1M+ requests/second:`
2	`local redis = require "resty.redis"`
3	`local cjson = require "cjson"`
4
5	`local red = redis:new()`
6	`red:connect("127.0.0.1", 6379)`
7
8	`local user_id = ngx.var.arg_id`
9	`local user_data = red:get("user:" .. user_id)`
10
11	`if user_data then`
12	`ngx.header.content_type = "application/json"`
13	`ngx.say(user_data)`
14	`else`
15	`ngx.status = 404`
16	`ngx.say(cjson.encode({error = "User not found"}))`
17	`end`

1	`-- LÖVE2D game engine with LuaJIT`
2	`function love.update(dt)`
3	`-- This particle system can handle 100K+ particles at 60 FPS`
4	`for i = 1, #particles do`
5	`local p = particles[i]`
6	`p.vy = p.vy + gravity * dt`
7	`p.x = p.x + p.vx * dt`
8	`p.y = p.y + p.vy * dt`
9	`p.life = p.life - dt`
10
11	`if p.life <= 0 then`
12	`table.remove(particles, i)`
13	`end`
14	`end`
15	`end`

1	`-- SciLua: Scientific computing in LuaJIT`
2	`local ffi = require("ffi")`
3	`local C = ffi.C`
4
5	`ffi.cdef[[`
6	`void dgemm_(char* transa, char* transb, int* m, int* n, int* k,`
7	`double* alpha, double* a, int* lda, double* b, int* ldb,`
8	`double* beta, double* c, int* ldc);`
9	`]]`
10
11	`-- Direct BLAS calls with zero overhead!`
12	`local function matrix_multiply(A, B, C, m, n, k)`
13	`local alpha, beta = 1.0, 0.0`
14	`local transa, transb = 'N', 'N'`
15	`C.dgemm_(transa, transb, m, n, k,`
16	`ffi.new("double[1]", alpha), A, m, B, k,`
17	`ffi.new("double[1]", beta), C, m)`
18	`end`

1	`// Simplified trace recording logic`
2	`void record_trace() {`
3	`while (recording) {`
4	`BCIns ins = *pc++; // Get next bytecode`
5
6	`switch (bc_op(ins)) {`
7	`case BC_ADD:`
8	`emit_ir(IR_ADD, bc_a(ins), bc_b(ins), bc_c(ins));`
9	`break;`
10	`case BC_LOOP:`
11	`if (++loop_count > HOTLOOP_THRESHOLD) {`
12	`end_trace();`
13	`compile_trace();`
14	`}`
15	`break;`
16	`// ... guards for type checks, bounds checks, etc`
17	`}`
18	`}`
19	`}`

1	`-- Normal Lua creates a temporary table here`
2	`function vector_add(x1, y1, x2, y2)`
3	`local v1 = {x = x1, y = y1} -- Allocation?`
4	`local v2 = {x = x2, y = y2} -- Another allocation?`
5	`return v1.x + v2.x, v1.y + v2.y`
6	`end`
7
8	`-- LuaJIT's trace compiler sees through this!`
9	`-- After optimization, it compiles to:`
10	`-- return x1 + x2, y1 + y2`
11	`-- The tables NEVER GET ALLOCATED`

1	`function compute_distance(points, i, j)`
2	`local dx = points[i].x - points[j].x`
3	`local dy = points[i].y - points[j].y`
4
5	`-- These array lookups are identical to above`
6	`local norm_x = (points[i].x - points[j].x) / distance`
7	`local norm_y = (points[i].y - points[j].y) / distance`
8
9	`return norm_x, norm_y`
10	`end`
11
12	`-- LuaJIT recognizes the repeated subexpressions:`
13	`-- points[i].x - points[j].x (computed once)`
14	`-- points[i].y - points[j].y (computed once)`

1	`function process_data(data, multiplier, offset)`
2	`local result = 0`
3	`local factor = multiplier * 2.5 + offset -- Loop invariant`
4
5	`for i = 1, #data do`
6	`local adjusted = data[i] * (multiplier * 2.5 + offset) -- Same calculation!`
7	`result = result + adjusted`
8	`end`
9	`return result`
10	`end`
11
12	`-- LuaJIT hoists the invariant calculation:`
13	`-- multiplier * 2.5 + offset computed ONCE before loop`

1	`function update_positions(objects, dt)`
2	`for i = 1, #objects do`
3	`objects[i].x = objects[i].x + objects[i].vx * dt`
4	`objects[i].y = objects[i].y + objects[i].vy * dt`
5	`-- Can we be sure objects[i] didn't change between accesses?`
6	`end`
7	`end`

1	`function sum_positive(arr)`
2	`local sum = 0`
3	`for i = 1, #arr do`
4	`if type(arr[i]) == "number" then -- Type guard`
5	`if arr[i] > 0 then -- Value guard`
6	`sum = sum + arr[i]`
7	`end`
8	`end`
9	`end`
10	`return sum`
11	`end`

1	`// LuaJIT's value representation (simplified)`
2	`typedef union {`
3	`double n; // Numbers stored directly`
4	`uint64_t u64; // For type tagging`
5	`} TValue;`
6
7	`// Special NaN patterns encode types:`
8	`// 0xfff8000000000000 \| type \| payload`
9	`// This allows:`
10	`// - Numbers: stored as-is (fast path)`
11	`// - Pointers: encoded in NaN space`
12	`// - Booleans, nil: special NaN values`

1	`; Inner loop after all optimizations:`
2	`.loop:`
3	`movsd xmm0, [rsi+rax*8] ; A[i][k]`
4	`movsd xmm1, [rdx+rcx*8] ; B[k][j]`
5	`mulsd xmm0, xmm1 ; multiply`
6	`addsd xmm2, xmm0 ; accumulate`
7	`add rax, 1 ; k++`
8	`cmp rax, r8 ; k < n?`
9	`jl .loop`

1	`-- Generic addition must handle all types:`
2	`function add(a, b)`
3	`return a + b -- Could be numbers, strings, tables with __add`
4	`end`
5
6	`-- But in a trace where a and b are always numbers:`
7	`-- LuaJIT generates:`
8	`// movsd xmm0, [rax] -- Load double directly`
9	`// addsd xmm0, [rbx] -- Floating-point add`
10	`// movsd [rcx], xmm0 -- Store double`
11	`// No type checking needed!`

1	`-- Naive matrix multiplication`
2	`function matmul(A, B, C, n)`
3	`for i = 1, n do`
4	`for j = 1, n do`
5	`local sum = 0`
6	`for k = 1, n do`
7	`sum = sum + A[i][k] * B[k][j]`
8	`end`
9	`C[i][j] = sum`
10	`end`
11	`end`
12	`end`
13
14	`-- After LuaJIT optimization:`
15	`-- 1. Type guards eliminated (arrays proven homogeneous)`
16	`-- 2. Bounds checks eliminated (loop bounds proven safe)`
17	`-- 3. A[i] hoisted out of k-loop (LICM)`
18	`-- 4. No temporary tables allocated`
19	`-- 5. Innermost loop unrolled and vectorized`
20
21	`-- Result: 20x faster than interpreted Lua`
22	`-- Only 2x slower than optimized C (gcc -O3)`

1	`/* This is typical LuaJIT code - brilliant but impenetrable */`
2	`static void asm_fuseahuref(ASMState as, IRRef ref, int32_t ofsp, RegSet allow)`
3	`{`
4	`IRIns *ir = IR(ref);`
5	`if (ra_noreg(ir->r)) {`
6	`if (ir->o == IR_AREF) {`
7	`if (mayfuse(as, ref)) {`
8	`if (irref_isk(ir->op2)) {`
9	`IRRef tab = IR(ir->op1)->op1;`
10	`int32_t ofs = 8*IR(ir->op2)->i;`
11	`if (checki16(ofs)) {`
12	`*ofsp = ofs;`
13	`return (int32_t)ra_alloc1(as, tab, allow);`
14	`}`
15	`}`
16	`}`
17	`} else if (ir->o == IR_HREFK) {`
18	`if (mayfuse(as, ref)) {`
19	`int32_t ofs = (int32_t)(IR(ir->op2)->op2 * sizeof(Node));`
20	`if (checki16(ofs)) {`
21	`*ofsp = ofs;`
22	`return (int32_t)ra_alloc1(as, ir->op1, allow);`
23	`}`
24	`}`
25	`}`
26	`}`
27	`*ofsp = 0;`
28	`return ra_alloc1(as, ref, allow);`
29	`}`

1	`-- This ray tracer runs at 60 FPS in LuaJIT`
2	`-- Try this in Python and watch your CPU melt`
3
4	`local ffi = require("ffi")`
5	`local C = ffi.C`
6
7	`ffi.cdef[[`
8	`typedef struct { double x, y, z; } vec3;`
9	`double sqrt(double);`
10	`double pow(double, double);`
11	`]]`
12
13	`local vec3 = ffi.typeof("vec3")`
14
15	`local function dot(a, b)`
16	`return a.xb.x + a.yb.y + a.z*b.z`
17	`end`
18
19	`local function normalize(v)`
20	`local len = C.sqrt(dot(v, v))`
21	`return vec3(v.x/len, v.y/len, v.z/len)`
22	`end`
23
24	`local function trace(orig, dir, spheres, depth)`
25	`-- Ray tracing with full reflections`
26	`-- This inner loop runs millions of times per second`
27	`local nearest_t = 1e20`
28	`local nearest_sphere = nil`
29
30	`for i = 1, #spheres do`
31	`local sphere = spheres[i]`
32	`local oc = vec3(orig.x - sphere.x, orig.y - sphere.y, orig.z - sphere.z)`
33	`local b = dot(oc, dir)`
34	`local c = dot(oc, oc) - sphere.radius * sphere.radius`
35	`local disc = b*b - c`
36
37	`if disc > 0 then`
38	`local t = -b - C.sqrt(disc)`
39	`if t > 0.001 and t < nearest_t then`
40	`nearest_t = t`
41	`nearest_sphere = sphere`
42	`end`
43	`end`
44	`end`
45
46	`-- ... reflection calculation ...`
47	`return color`
48	`end`
49
50	`-- This outperforms C++ ray tracers that use virtual functions`
51	`-- Because LuaJIT inlines EVERYTHING`