Ever wondered how databases work under the hood? I decided to find out by building one from scratch. Meet Bazoola - a simple file-based database written in pure Python.
Why Build a Toy Database?
As a developer, I use relational databases every day, but I never truly understood what happens when I INSERT
or SELECT
. Building a database from scratch taught me more about data structures, file I/O, and system design than any tutorial ever could.
Plus, it's fun to implement something that seems like magic!
What is Bazoola?
Bazoola is a lightweight, educational database that stores data in human-readable text files. It's not meant to replace SQLite or PostgreSQL - it's meant to help understand how databases work.
Key Features:
Fixed-width column storage
CRUD operations - the basics every database needs
Foreign keys - because relationships matter
Automatic space management - reuses deleted record space
Human-readable files - you can literally
cat
your data
The Core Idea: Fixed-Width Records
Unlike CSV files where records have variable length, Bazoola uses fixed-width fields:
1 Alice 25
2 Bob 30
3 Charlie 35
This makes it trivial to:
Jump to any record by calculating its offset
Update records in-place without rewriting the file
Build simple indexes
Show Me the Code!
Here's how you use Bazoola:
from bazoola import *
# Define schemas
class TableAuthors(Table):
name = "authors"
schema = Schema([
Field("id", PK()),
Field("name", CHAR(20)),
])
class TableBooks(Table):
name = "books"
schema = Schema([
Field("id", PK()),
Field("title", CHAR(20)),
Field("author_id", FK("authors")),
Field("year", INT(null=True)),
])
# Create database instance and the files in `data/` subdir
db = DB([TableAuthors, TableBooks])
# Insert data
author = db.insert("authors", {"name": "John Doe"})
book = db.insert("books", {
"title": "My Book",
"author_id": author["id"],
"year": 2024
})
# Query with joins
book_with_author = db.find_by_id(
"books",
book["id"],
joins=[Join("author_id", "author", "authors")]
)
print(book_with_author)
# Output: {'id': 1, 'title': 'My Book', 'author_id': 1, 'year': 2024, 'author': {'id': 1, 'name': 'John Doe'}}
# Close the database
db.close()
Interesting Implementation Details
Space Management
When you delete a record, Bazoola fills it out with *
symbols, maintaining a stack of free positions:
class Table:
# ...
def delete_by_id(self, pk: int) -> None:
# ...
self.f.seek(rownum * self.row_size)
self.rownum_index.set(pk - 1, None)
self.f.write(b"*" * (self.row_size - 1) + b"\n")
self.free_rownums.push(rownum)
# ...
def seek_insert(self) -> None:
rownum = self.free_rownums.pop()
if rownum is not None:
self.f.seek(rownum * self.row_size)
return
self.f.seek(0, os.SEEK_END)
This simple approach prevents the database from growing indefinitely.
Indexing
Every table automatically gets an index on its primary key in <table_name>__id.idx.dat
file, so finding a record by ID is O(1) - just look up the position and seek!
Foreign Keys
Bazoola supports relationships between tables and joins:
# Query with joins
rows = db.find_all("books", joins=[
Join("author_id", "author", "authors")
])
# Inverse joins (one-to-many)
author = db.find_by_id("authors", 1, joins=[
InverseJoin("author_id", "books", "books")
])
However, it doesn't enforce referencial integrity yet, and you can basically delete anything you want ¯\_(ツ)_/¯
Human-Readable Storage
Data files are just formatted text:
$ cat data/books.dat
1 My Book 1 2024
************************************
3 My Book3 1 2024
Great for debugging and understanding what's happening!
What I Learned
Data format and file-based storages are tricky, that's why it's all just test-based.
Fixed-width has trade-offs - simple implementation, but wastes space. Real databases use more sophisticated formats.
Indexes are magic: the difference between O(n) table scan and O(1) index lookup is massive, even with small datasets.
Joins are hard.
Testing is crucial. Database bugs can corrupt data, and comprehensive tests saved me many times.
Limitations (and why they're OK)
The current implementation is intentionally simple:
no transactions (what if it fails mid-write?)
no concurrency (single-threaded only)
no query optimizer (every operation is naive)
no B-tree indices (just a simple index for primary keys)
no SQL (just Python API)
fixed-width wastes space
These aren't bugs - they're learning opportunities! Each limitation is a rabbit hole to explore.
Try It Yourself!
pip install bazoola
The code is on GitHub. It's ~700 lines of Python - small enough to read in one sitting!
What's Next?
Building Bazoola opened my eyes to database internals. Some ideas for further versions:
TEXT
fields with a separate storage,strict foreign key constraints,
unique constraints,
B-tree indexes for range queries,
ALTER TABLE
,simple query planner,
pure C implementation with Python bindings,
binary storage,
and so on and so forth.
But honestly, even this simple version taught me a ton.
Conclusion
Building a toy database is one of the best learning projects I've done. It's complex enough to be challenging but simple enough to finish. You'll understand your "real" database better, and you might even have fun!