Skip to main content
2023-07-1522 min read
Performance Engineering

LuaJIT: The One-Man Compiler That Embarrassed the Industry

LuaJIT: The One-Man Compiler That Embarrassed the Industry

Imagine a scripting language that's 50-100x faster than Python, often within 2x of C performance, with a foreign function interface so elegant it makes ctypes look like cave paintings. Now imagine this was all created by one person, working alone, who then disappeared—leaving behind a codebase so sophisticated that some of the world's best compiler engineers struggle to understand it.
This is the story of LuaJIT and its creator Mike Pall, a tale that exposes uncomfortable truths about software engineering, the limits of corporate development, and what happens when genius-level work becomes too clever for its own good.
LuaJIT isn't just fast—it pioneered compiler techniques that wouldn't appear in mainstream JITs for years. Its trace compiler was doing optimizations in 2009 that V8 only started implementing in 2017. Yet today, it's slowly dying because no one can maintain it.

The Shocking Numbers That Started Everything

Let me show you why LuaJIT matters with some benchmarks that still make people do a double-take:
lua
1-- Naive Fibonacci in different languages
2function fib(n)
3 if n < 2 then return n end
4 return fib(n-1) + fib(n-2)
5end
6
7-- Benchmark results for fib(40):
8-- CPython 3.9: ~32 seconds
9-- Ruby 3.0: ~19 seconds
10-- PHP 8.0: ~11 seconds
11-- JavaScript V8: ~1.2 seconds
12-- Lua 5.4: ~15 seconds
13-- LuaJIT 2.1: ~0.8 seconds
14-- C (gcc -O2): ~0.5 seconds
LuaJIT is beating V8—Google's JavaScript engine with a massive team and billions in funding—with a JIT compiler written by one person. But raw recursion isn't even where LuaJIT shines brightest:
lua
1-- Numeric computation benchmark
2local function mandelbrot(N)
3 local width, height, limit2 = N, N, 4.0
4 local iter = 50
5 local bits, bit = 0, 128
6
7 for y = 0, height - 1 do
8 for x = 0, width - 1 do
9 local Zr, Zi, Cr, Ci = 0.0, 0.0, 0.0, 0.0
10 Cr = 2.0 * x / width - 1.5
11 Ci = 2.0 * y / height - 1.0
12
13 local i = iter
14 repeat
15 local Tr = Zr * Zr - Zi * Zi + Cr
16 local Ti = 2.0 * Zr * Zi + Ci
17 Zr, Zi = Tr, Ti
18 i = i - 1
19 until (Zr * Zr + Zi * Zi > limit2) or (i == 0)
20
21 if i == 0 then bits = bits + bit end
22 if bit == 1 then
23 io.write(string.char(bits))
24 bits, bit = 0, 128
25 else
26 bit = bit / 2
27 end
28 end
29 end
30end
31
32-- Performance (smaller is better):
33-- C (gcc -O3): 0.146s
34-- LuaJIT: 0.172s (only 18% slower!)
35-- Java: 0.291s
36-- JavaScript V8: 0.385s
37-- Lua 5.4: 8.937s (52x slower than LuaJIT!)
⚠️
These aren't cherry-picked benchmarks. LuaJIT consistently delivers near-C performance for numeric code, often beating Java and always destroying other dynamic languages. The Computer Language Benchmarks Game had to create separate categories because LuaJIT was making other scripting languages look bad.

The Man Behind the Magic: Mike Pall

Mike Pall is the John Carmack of compiler engineering—a programmer so far ahead of the curve that his work seems like magic. But unlike Carmack, who became a celebrity, Pall remained obscure, communicating mainly through mailing list posts that read like doctoral dissertations.
His background was in assembly language and low-level optimization. Before LuaJIT, he'd contributed to various open source projects, always focused on performance. When he discovered Lua in 2005, he saw an opportunity: Lua was small, clean, and had clear semantics—perfect for a JIT compiler.
What happened next was unprecedented. Working entirely alone, without corporate backing, Pall created:
  1. LuaJIT 1.x (2005-2012): A straightforward JIT that was already faster than most alternatives
  2. LuaJIT 2.0 (2009-2017): A complete rewrite using trace compilation, achieving near-C performance
  3. The FFI (2011): A foreign function interface so elegant it redefined what was possible
Then, around 2017, he largely disappeared. The git commits slowed, then stopped. The mailing list went quiet. LuaJIT development essentially froze.

Trace Compilation: The Secret Weapon

To understand why LuaJIT is so fast, you need to understand trace compilation—a technique so powerful that it seems like cheating.
Traditional JIT compilers work on methods:

Source Code

Parse to Bytecode

Identify Hot Methods

Compile Entire Method

Native Code

But trace compilers work on execution paths:

Yes

No

Interpreter Running

Hot Loop Detected?

Start Recording

Record Actual Execution Path

Compile Just That Path

Super Optimized Native Code

Here's why this is genius:
lua
1-- Consider this code:
2function process_data(items)
3 local sum = 0
4 for i = 1, #items do
5 if items[i] > 0 then -- Branch
6 if items[i] < 100 then -- Another branch
7 sum = sum + items[i]
8 else
9 sum = sum + 100
10 end
11 end
12 end
13 return sum
14end
15
16-- Method JIT must compile all possible paths
17-- Trace JIT only compiles the path actually taken!
When LuaJIT detects a hot loop, it doesn't compile the function—it records exactly what happens:
-- If your data is usually positive numbers under 100,
-- LuaJIT records THIS trace:
TRACE 1 start process_data:2
0001  GGET     2   0      ; "items"
0002  LEN      3   2
0003  LOOP     4 => 0013
0004  GGET     5   0      ; "items" 
0005  TGET     6   5   4  ; items[i]
0006  KSHORT   7   0      ; 0
0007  ISGT     6   7      ; Guard: items[i] > 0
0008  KSHORT   8   100    ; 100
0009  ISLT     6   8      ; Guard: items[i] < 100
0010  ADD      1   1   6  ; sum = sum + items[i]
0011  ADDVN    4   4   0  ; i = i + 1
0012  JMP      4 => 0003
0013  TRACE 1 stop -> loop
The Guards are the magic—they check if we're still on the recorded path. If a guard fails, we fall back to the interpreter and maybe record a new trace.

The FFI: When Zero Cost Actually Means Zero

LuaJIT's Foreign Function Interface is perhaps its most revolutionary feature. While Python's ctypes makes you want to cry, LuaJIT's FFI is so simple it seems like it shouldn't work:
lua
1-- This is ALL you need to call C functions:
2local ffi = require("ffi")
3ffi.cdef[[
4 typedef struct { double x, y; } point_t;
5 double sqrt(double x);
6 double atan2(double y, double x);
7]]
8
9-- Now just use it!
10local function distance(p1, p2)
11 local dx = p2.x - p1.x
12 local dy = p2.y - p1.y
13 return math.sqrt(dx*dx + dy*dy) -- Calls C's sqrt directly!
14end
15
16-- Creating C structs is trivial
17local p1 = ffi.new("point_t", {x = 10, y = 20})
18local p2 = ffi.new("point_t", {x = 30, y = 40})
19print(distance(p1, p2)) -- 28.284271247462
But here's the kicker—this has ZERO overhead. The JIT compiler inlines everything:
lua
1-- This Lua code:
2local function add_arrays(a, b, c, n)
3 for i = 0, n-1 do
4 c[i] = a[i] + b[i]
5 end
6end
7
8-- Compiles to THIS machine code:
9-- movsd xmm0, [rsi+rax*8] ; Load a[i]
10-- addsd xmm0, [rdx+rax*8] ; Add b[i]
11-- movsd [rcx+rax*8], xmm0 ; Store to c[i]
12-- inc rax ; i++
13-- cmp rax, rdi ; i < n?
14-- jl loop ; Loop if true
15
16-- That's literally what a C compiler would generate!

Real-World Domination

LuaJIT's performance made it the choice for performance-critical applications everywhere:

1. OpenResty: Nginx on Steroids

OpenResty embeds LuaJIT into Nginx, creating a web server that can handle application logic at wire speed:
lua
1-- This handles 1M+ requests/second:
2local redis = require "resty.redis"
3local cjson = require "cjson"
4
5local red = redis:new()
6red:connect("127.0.0.1", 6379)
7
8local user_id = ngx.var.arg_id
9local user_data = red:get("user:" .. user_id)
10
11if user_data then
12 ngx.header.content_type = "application/json"
13 ngx.say(user_data)
14else
15 ngx.status = 404
16 ngx.say(cjson.encode({error = "User not found"}))
17end
Cloudflare, Alibaba, and Tumblr all run OpenResty at massive scale. Cloudflare handles 25 million HTTP requests per second using LuaJIT for edge computing.

2. Game Development: From Angry Birds to World of Warcraft

Every World of Warcraft addon runs on Lua, but games using LuaJIT saw 10-50x performance improvements:
lua
1-- LÖVE2D game engine with LuaJIT
2function love.update(dt)
3 -- This particle system can handle 100K+ particles at 60 FPS
4 for i = 1, #particles do
5 local p = particles[i]
6 p.vy = p.vy + gravity * dt
7 p.x = p.x + p.vx * dt
8 p.y = p.y + p.vy * dt
9 p.life = p.life - dt
10
11 if p.life <= 0 then
12 table.remove(particles, i)
13 end
14 end
15end

3. Scientific Computing: When Python is Too Slow

Scientists discovered LuaJIT could replace Python+NumPy for many tasks:
lua
1-- SciLua: Scientific computing in LuaJIT
2local ffi = require("ffi")
3local C = ffi.C
4
5ffi.cdef[[
6 void dgemm_(char* transa, char* transb, int* m, int* n, int* k,
7 double* alpha, double* a, int* lda, double* b, int* ldb,
8 double* beta, double* c, int* ldc);
9]]
10
11-- Direct BLAS calls with zero overhead!
12local function matrix_multiply(A, B, C, m, n, k)
13 local alpha, beta = 1.0, 0.0
14 local transa, transb = 'N', 'N'
15 C.dgemm_(transa, transb, m, n, k,
16 ffi.new("double[1]", alpha), A, m, B, k,
17 ffi.new("double[1]", beta), C, m)
18end

The Architecture: How Mike Pall Did It

LuaJIT's architecture is a masterclass in compiler design. Here's the high-level view:

Runtime

Trace Compiler

Interpreter

Frontend

Yes

No

Yes

No

Lua Source

Bytecode Compiler

Optimized Bytecode

Fast Interpreter

Hot?

Trace Recorder

SSA IR Builder

Optimization Passes

Register Allocator

Machine Code Gen

Code Cache

Native Execution

Guard OK?

1. The Bytecode: Designed for Speed

LuaJIT uses a register-based bytecode (unlike Lua's stack-based), with each instruction carefully designed to map efficiently to machine code:
-- Lua source:
local x = a + b * c

-- LuaJIT bytecode:
GGET  0  "a"      ; Load global 'a' into register 0
GGET  1  "b"      ; Load global 'b' into register 1  
GGET  2  "c"      ; Load global 'c' into register 2
MUL   1  1  2     ; r1 = r1 * r2
ADD   0  0  1     ; r0 = r0 + r1

2. The Trace Recorder: Watching Your Code Run

When the interpreter detects a hot loop, it switches to recording mode:
c
1// Simplified trace recording logic
2void record_trace() {
3 while (recording) {
4 BCIns ins = *pc++; // Get next bytecode
5
6 switch (bc_op(ins)) {
7 case BC_ADD:
8 emit_ir(IR_ADD, bc_a(ins), bc_b(ins), bc_c(ins));
9 break;
10 case BC_LOOP:
11 if (++loop_count > HOTLOOP_THRESHOLD) {
12 end_trace();
13 compile_trace();
14 }
15 break;
16 // ... guards for type checks, bounds checks, etc
17 }
18 }
19}

3. The IR: Static Single Assignment Form

LuaJIT converts bytecode to an SSA-based intermediate representation:
-- For this Lua code:
local sum = 0
for i = 1, n do
    sum = sum + arr[i]
end

-- LuaJIT generates this IR:
0001 >  int SLOAD  #2    CI  ; n
0002 >  int LE     0001  +2147483646
0003    int SLOAD  #1    CI  ; i
0004    p32 AREF   P[0x400] 0003
0005 >  num ALOAD  0004
0006    num SLOAD  #3        ; sum
0007 +  num ADD    0006  0005
0008 +  int ADD    0003  +1
0009 >  int LE     0008  0001
0010 ------ LOOP ------------
0011 >  int PHI    0003  0008  ; Loop variable
0012 >  num PHI    0006  0007  ; Accumulator

4. The Optimizations: Where the Magic Happens

LuaJIT applies sophisticated optimizations that rival commercial compilers. Let's dive deep into each one to understand why they're so powerful:

Allocation Sinking: Objects That Never Exist

This is perhaps LuaJIT's most impressive trick. Consider this code:
lua
1-- Normal Lua creates a temporary table here
2function vector_add(x1, y1, x2, y2)
3 local v1 = {x = x1, y = y1} -- Allocation?
4 local v2 = {x = x2, y = y2} -- Another allocation?
5 return v1.x + v2.x, v1.y + v2.y
6end
7
8-- LuaJIT's trace compiler sees through this!
9-- After optimization, it compiles to:
10-- return x1 + x2, y1 + y2
11-- The tables NEVER GET ALLOCATED
How does this work? LuaJIT tracks object allocations and their uses. If an object:
  1. Doesn't escape the trace (isn't returned or stored globally)
  2. Is only used for field access
  3. Has fields that can be computed at compile time
Then LuaJIT "sinks" the allocation—it never happens! The fields become virtual registers:
-- IR before allocation sinking:
0001  TABLE  t1
0002  STORE  t1.x  x1
0003  STORE  t1.y  y1
0004  TABLE  t2
0005  STORE  t2.x  x2
0006  STORE  t2.y  y2
0007  LOAD   r1  t1.x
0008  LOAD   r2  t2.x
0009  ADD    r3  r1  r2

-- IR after allocation sinking:
0001  ADD    r3  x1  x2  -- Tables disappeared!

Common Subexpression Elimination (CSE): Never Compute Twice

LuaJIT aggressively eliminates redundant computations:
lua
1function compute_distance(points, i, j)
2 local dx = points[i].x - points[j].x
3 local dy = points[i].y - points[j].y
4
5 -- These array lookups are identical to above
6 local norm_x = (points[i].x - points[j].x) / distance
7 local norm_y = (points[i].y - points[j].y) / distance
8
9 return norm_x, norm_y
10end
11
12-- LuaJIT recognizes the repeated subexpressions:
13-- points[i].x - points[j].x (computed once)
14-- points[i].y - points[j].y (computed once)
The CSE pass builds a hash table of all computed expressions. When it sees a duplicate, it reuses the previous result:
-- Before CSE:
0001  TGET   r1  points  i     -- points[i]
0002  FLOAD  r2  r1.x          -- .x
0003  TGET   r3  points  j     -- points[j]
0004  FLOAD  r4  r3.x          -- .x
0005  SUB    r5  r2  r4        -- dx
...
0010  TGET   r6  points  i     -- points[i] AGAIN
0011  FLOAD  r7  r6.x          -- .x AGAIN
0012  TGET   r8  points  j     -- points[j] AGAIN
0013  FLOAD  r9  r8.x          -- .x AGAIN
0014  SUB    r10 r7  r9        -- Same subtraction!

-- After CSE:
0001  TGET   r1  points  i
0002  FLOAD  r2  r1.x
0003  TGET   r3  points  j
0004  FLOAD  r4  r3.x
0005  SUB    r5  r2  r4
...
0010  ; Instructions 10-14 eliminated, reuse r5!

Loop Invariant Code Motion (LICM): Don't Repeat in Loops

This optimization moves calculations that don't change out of loops:
lua
1function process_data(data, multiplier, offset)
2 local result = 0
3 local factor = multiplier * 2.5 + offset -- Loop invariant
4
5 for i = 1, #data do
6 local adjusted = data[i] * (multiplier * 2.5 + offset) -- Same calculation!
7 result = result + adjusted
8 end
9 return result
10end
11
12-- LuaJIT hoists the invariant calculation:
13-- multiplier * 2.5 + offset computed ONCE before loop
LICM uses dominator tree analysis to find expressions that:
  1. Have operands that don't change in the loop
  2. Don't have side effects
  3. Are guaranteed to execute (no guard failures)
-- Before LICM:
LOOP:
  0001  MUL    r1  multiplier  2.5    -- Inside loop
  0002  ADD    r2  r1  offset         -- Inside loop
  0003  TGET   r3  data  i
  0004  MUL    r4  r3  r2
  0005  ADD    result  result  r4
  0006  ITERN  i  => LOOP

-- After LICM:
0001  MUL    r1  multiplier  2.5      -- Moved out!
0002  ADD    r2  r1  offset           -- Moved out!
LOOP:
  0003  TGET   r3  data  i
  0004  MUL    r4  r3  r2             -- Uses pre-computed r2
  0005  ADD    result  result  r4
  0006  ITERN  i  => LOOP

Alias Analysis: Proving Memory Independence

This is crucial for optimization. LuaJIT must prove that memory operations don't interfere:
lua
1function update_positions(objects, dt)
2 for i = 1, #objects do
3 objects[i].x = objects[i].x + objects[i].vx * dt
4 objects[i].y = objects[i].y + objects[i].vy * dt
5 -- Can we be sure objects[i] didn't change between accesses?
6 end
7end
LuaJIT's alias analysis tracks:
  • Type-based aliasing: Numbers can't alias with tables
  • Field-based aliasing: Different fields don't overlap
  • Index-based aliasing: Different array indices are independent
-- Alias analysis proves these don't interfere:
STORE  objects[i].x  new_x   -- Can't affect .y or .vx
LOAD   objects[i].y          -- Safe to load
LOAD   objects[i].vx         -- Safe to load

Guard Elimination: Removing Redundant Checks

This is where trace compilation really shines. Guards ensure we stay on the fast path:
lua
1function sum_positive(arr)
2 local sum = 0
3 for i = 1, #arr do
4 if type(arr[i]) == "number" then -- Type guard
5 if arr[i] > 0 then -- Value guard
6 sum = sum + arr[i]
7 end
8 end
9 end
10 return sum
11end
But checking the same thing repeatedly is wasteful. LuaJIT eliminates redundant guards:
-- First iteration:
0001  TGET   r1  arr  1
0002  ISNUM  r1            -- Guard: is it a number?
0003  ISPOS  r1            -- Guard: is it positive?
0004  ADD    sum  sum  r1

-- Second iteration (naive):
0005  TGET   r2  arr  2
0006  ISNUM  r2            -- Same type check again?
0007  ISPOS  r2            -- Another value check?

-- After guard elimination:
0005  TGET   r2  arr  2
0006  ; Type guard eliminated - array is homogeneous!
0007  ISPOS  r2            -- Value guard kept (values differ)

Advanced: NaN-Boxing and Type Specialization

LuaJIT stores all Lua values in 64-bit slots using NaN-boxing:
c
1// LuaJIT's value representation (simplified)
2typedef union {
3 double n; // Numbers stored directly
4 uint64_t u64; // For type tagging
5} TValue;
6
7// Special NaN patterns encode types:
8// 0xfff8000000000000 | type | payload
9// This allows:
10// - Numbers: stored as-is (fast path)
11// - Pointers: encoded in NaN space
12// - Booleans, nil: special NaN values
This enables type specialization:
lua
1-- Generic addition must handle all types:
2function add(a, b)
3 return a + b -- Could be numbers, strings, tables with __add
4end
5
6-- But in a trace where a and b are always numbers:
7-- LuaJIT generates:
8// movsd xmm0, [rax] -- Load double directly
9// addsd xmm0, [rbx] -- Floating-point add
10// movsd [rcx], xmm0 -- Store double
11// No type checking needed!

The Optimization Pipeline

Here's how all these optimizations work together:

Recorded Trace

SSA Construction

Type Inference

Alias Analysis

CSE Pass

LICM Pass

Guard Elimination

Allocation Sinking

DCE - Dead Code Elimination

Register Allocation

Machine Code Generation

Each optimization enables others:
  • Type inference enables guard elimination
  • Guard elimination enables CSE (same types guaranteed)
  • CSE enables allocation sinking (fewer uses to track)
  • LICM reduces register pressure for better allocation

Real-World Impact: Matrix Multiplication

Let's see all optimizations in action:
lua
1-- Naive matrix multiplication
2function matmul(A, B, C, n)
3 for i = 1, n do
4 for j = 1, n do
5 local sum = 0
6 for k = 1, n do
7 sum = sum + A[i][k] * B[k][j]
8 end
9 C[i][j] = sum
10 end
11 end
12end
13
14-- After LuaJIT optimization:
15-- 1. Type guards eliminated (arrays proven homogeneous)
16-- 2. Bounds checks eliminated (loop bounds proven safe)
17-- 3. A[i] hoisted out of k-loop (LICM)
18-- 4. No temporary tables allocated
19-- 5. Innermost loop unrolled and vectorized
20
21-- Result: 20x faster than interpreted Lua
22-- Only 2x slower than optimized C (gcc -O3)
The generated assembly is shockingly good:
asm
1; Inner loop after all optimizations:
2.loop:
3 movsd xmm0, [rsi+rax*8] ; A[i][k]
4 movsd xmm1, [rdx+rcx*8] ; B[k][j]
5 mulsd xmm0, xmm1 ; multiply
6 addsd xmm2, xmm0 ; accumulate
7 add rax, 1 ; k++
8 cmp rax, r8 ; k < n?
9 jl .loop
That's literally what a C compiler would generate. No overhead. No type checks. No bounds checks. Just pure computation.

The Tragedy: Too Clever to Live

Around 2015, cracks began to show. Mike Pall was burning out. The mailing list posts became terser, then stopped. The commit frequency dropped. By 2017, development had essentially ceased.
The problem? LuaJIT is too sophisticated for its own good:

1. The Bus Factor of One

The codebase is ~100,000 lines of dense C and assembly, with architecture-specific backends for x86, x64, ARM, PPC, and MIPS. Comments are sparse. The design lives in Mike Pall's head.
c
1/* This is typical LuaJIT code - brilliant but impenetrable */
2static void asm_fuseahuref(ASMState *as, IRRef ref, int32_t *ofsp, RegSet allow)
3{
4 IRIns *ir = IR(ref);
5 if (ra_noreg(ir->r)) {
6 if (ir->o == IR_AREF) {
7 if (mayfuse(as, ref)) {
8 if (irref_isk(ir->op2)) {
9 IRRef tab = IR(ir->op1)->op1;
10 int32_t ofs = 8*IR(ir->op2)->i;
11 if (checki16(ofs)) {
12 *ofsp = ofs;
13 return (int32_t)ra_alloc1(as, tab, allow);
14 }
15 }
16 }
17 } else if (ir->o == IR_HREFK) {
18 if (mayfuse(as, ref)) {
19 int32_t ofs = (int32_t)(IR(ir->op2)->op2 * sizeof(Node));
20 if (checki16(ofs)) {
21 *ofsp = ofs;
22 return (int32_t)ra_alloc1(as, ir->op1, allow);
23 }
24 }
25 }
26 }
27 *ofsp = 0;
28 return ra_alloc1(as, ref, allow);
29}

2. The Maintenance Nightmare

Several groups have tried to maintain LuaJIT:
  • OpenResty: Maintains a fork with bug fixes but no major features
  • moonjit: Attempted to continue development but stalled
  • RaptorJIT: Stripped down to x64-only for maintainability
None have added significant new optimizations. The code is just too complex.

3. The Architecture Trap

Modern CPUs have changed since 2009:
  • Spectre/Meltdown make certain optimizations unsafe
  • Apple Silicon needs a new backend
  • WebAssembly offers new possibilities
But LuaJIT's architecture is so tightly optimized for 2009-era x64 that adapting it is nearly impossible.

Lessons from the Rise and Fall

What LuaJIT Taught Us

  1. Trace compilation works: JavaScriptCore and SpiderMonkey eventually adopted similar techniques
  2. FFI can be zero-cost: Influenced Python's cffi and Julia's ccall
  3. One genius can beat a team: But only temporarily
  4. Performance matters: Users will adopt an obscure language if it's fast enough

What the Industry Learned (Or Didn't)

The tragedy of LuaJIT is that its techniques are proven to work, but:
  • V8 has 100+ engineers but took 8 years to catch up
  • Python is 50-100x slower but remains dominant
  • New JITs (GraalVM, etc.) ignore trace compilation
  • Corporate development rarely produces such innovations
⚠️
The real lesson? We're living in a world where the best compiler technology is abandoned because it's too good. LuaJIT proved that dynamic languages can be fast, but the industry chose slow and maintainable over fast and incomprehensible.

The Code That Could Have Changed Everything

Here's a final example that shows what we lost when LuaJIT development stopped:
lua
1-- This ray tracer runs at 60 FPS in LuaJIT
2-- Try this in Python and watch your CPU melt
3
4local ffi = require("ffi")
5local C = ffi.C
6
7ffi.cdef[[
8 typedef struct { double x, y, z; } vec3;
9 double sqrt(double);
10 double pow(double, double);
11]]
12
13local vec3 = ffi.typeof("vec3")
14
15local function dot(a, b)
16 return a.x*b.x + a.y*b.y + a.z*b.z
17end
18
19local function normalize(v)
20 local len = C.sqrt(dot(v, v))
21 return vec3(v.x/len, v.y/len, v.z/len)
22end
23
24local function trace(orig, dir, spheres, depth)
25 -- Ray tracing with full reflections
26 -- This inner loop runs millions of times per second
27 local nearest_t = 1e20
28 local nearest_sphere = nil
29
30 for i = 1, #spheres do
31 local sphere = spheres[i]
32 local oc = vec3(orig.x - sphere.x, orig.y - sphere.y, orig.z - sphere.z)
33 local b = dot(oc, dir)
34 local c = dot(oc, oc) - sphere.radius * sphere.radius
35 local disc = b*b - c
36
37 if disc > 0 then
38 local t = -b - C.sqrt(disc)
39 if t > 0.001 and t < nearest_t then
40 nearest_t = t
41 nearest_sphere = sphere
42 end
43 end
44 end
45
46 -- ... reflection calculation ...
47 return color
48end
49
50-- This outperforms C++ ray tracers that use virtual functions
51-- Because LuaJIT inlines EVERYTHING

The Bottom Line: A Cautionary Tale

LuaJIT represents both the pinnacle of compiler engineering and a cautionary tale about sustainable development. One brilliant developer created something that teams of engineers at Google, Microsoft, and Oracle struggled to match. But that same brilliance made it unmaintainable.
Today, LuaJIT still works. It's still fast. It still powers critical infrastructure. But it's frozen in time, a monument to what's possible when genius ignores conventional wisdom—and what happens when that genius walks away.
💡
The Takeaway: LuaJIT proved that dynamic languages don't have to be slow. It showed that one person with deep knowledge can outperform entire teams. But it also showed that sustainable software needs more than brilliance—it needs a community, documentation, and code that mortals can understand.
In the end, LuaJIT is like finding alien technology. We can use it, we can marvel at it, but we can barely comprehend it, much less improve it. It's a reminder that in software, being too far ahead of your time is indistinguishable from failure.
The industry chose mediocrity over brilliance. Python remains 100x slower. JavaScript engines use 100x more memory. But they have teams, documentation, and sustainable development.
Maybe that's the real lesson. Not that we should build like Mike Pall, but that we should build so others can build after us. Because in the end, the code that survives isn't the cleverest—it's the code that others can understand.
Still, one can't help but wonder: what if Mike Pall had kept going? What if LuaJIT had a team? We might be living in a world where dynamic languages are as fast as C, where scripting doesn't mean slow, where genius code could be both brilliant and sustainable.
But that's not the world we live in. Instead, we have LuaJIT: a masterpiece, frozen in amber, too perfect to improve, too complex to maintain, forever fast, forever alone.