index

Think before you CREATE TABLE: A relational algebra approach

After building a fair amount of applications for personal projects, there came a point when I started wondering: what if one of these actually took off? Would my database hold up? Would I be scrambling to fix schema issues while angry users are tweeting about downtime?

Sure, those apps work. The queries run fine with 100 users. But what about 10,000? 100,000? I have no way to know if I’m making horrible mistakes that I just haven’t encountered yet.

Personal projects are pretty low stakes, if the database gets weird I can just wipe it clean and start over. But I can’t do that with production applications. What if I join a company with millions of users and I’m the one who has to design a critical table? Messing up without even knowing is the worst thing that could happen.

The problem is: I don’t know what I don’t know. And SQL doesn’t tell you if you’re making a mistake when you’re designing something. It lets you do it.

What this post is really about

Apologies for taking so long to get to the point. This isn’t me pretending to be some sort of database design expert—this is just me going back to the fundamentals in an attempt to figure out a way to at least make sure I’m not making any critical mistakes.

What I hope to demonstrate is:

  • How to think about schema design using relational algebra concepts
  • Real examples of schemas that I’ve dealt with (I didn’t create them… well not ALL of them)
  • A methodology I’ll be developing to catch these issues early on

Common Schema Mistakes (And How Relational Algebra Exposes Them)

Example 1: The JSONB Flexibility Trap

CREATE TABLE products (
    id INT PRIMARY KEY,
    name VARCHAR(255),
    attributes JSONB  -- {"category": "boots", "color": "red", "size": "10"}
);

Initial reasoning: “Products have different attributes depending on their type. Boots have sizes, books have page counts. JSON gives us flexibility without needing a column for every possible attribute!”

It seemed perfect. No messy migrations when we add new product types.

The Problems

Problem 1: Simple queries became complicated

  • “Find all red boots in size 10”
  • Instead of a clean WHERE clause, I need JSON extraction functions
  • Every query requires parsing JSON at runtime

Problem 2: No way to enforce data quality

  • Nothing stops someone from entering {"color": "rde"} (typo)
  • Nothing requires boots to have a size — no CHECK constraints
  • Nothing validates that size is actually a number
  • The database can’t help me — JSON is just text with structure

Problem 3: Performance degraded

  • Filtering on attributes will do full table scans
  • Indexes won’t work (or required expression indexes that precompute values)
  • As the table grows, these queries will gradually become slower

What Relational Algebra Reveals

Query: Find red boots in size 10

σattributes->>‘category’=‘boots’ ∧ attributes->>‘color’=‘red’ ∧ attributes->>‘size’=‘10’(Products)

Problems:

1. attributes->>'color' is NOT a proper selection predicate

  • It’s a function call, not an attribute reference
  • The database can’t use a standard B-tree index on this. Sure, I can create an expression index: CREATE INDEX ON products ((attributes->>'color')). But now the database has to precompute and store these values in the index. So I’m storing the data twice anyway - once in JSON, once in the index!

2. There’s no way to express πcolor(Products)

  • color is not an attribute in relational algebra terms—it’s buried inside JSONB
  • I have to use: πattributes->>‘color’(Products). This returns NULL for products without a color key
  • And I can’t tell the difference between “no color” and “color key missing”

3. Constraint: ∀p ∈ σcategory=‘boots’: p.size ≠ NULL

  • Translation: “All boots must have a size”
  • Not possible to express this at the database level
  • No CHECK constraint can reach into JSON structure
  • No foreign key can validate JSONB values
  • Nothing stops me from inserting a boot with no size, or size: “extra large” (instead of something pre-defined like “UK07”)

What relational algebra tells us:

The expression we actually want to write is:

πp.name( σc.name=‘red’ ∧ s.name=‘10’( Products p ⋈ Boots b ⋈ Colors c ⋈ Sizes s ) )

Example 2: The Denormalized Audit Trail

CREATE TABLE user_audit (
    audit_id INT PRIMARY KEY,
    user_id INT,
    email VARCHAR(100),
    department_name VARCHAR(100),  -- Copied from departments table
    manager_name VARCHAR(100),     -- Copied from users table
    changed_at TIMESTAMP,
    changed_by INT
);

The reasoning seems solid: “We need to preserve exactly what things looked like at the moment of the change. If a department gets renamed later, we want the audit log to show the old name.”

The Problems

Problem 1: The audit trail becomes misleading

(The deeper issue here is we’re trying to preserve point-in-time state by copying data rather than maintaining temporal identity through department_id and a history table BUT I’ll have to research temporal modeling properly, so let’s skip that for now.)

  • Let’s say the Engineering department was renamed to “Engineering & AI”. Now all old audit entries show “Engineering”
  • Users looking at the audit log think: “Why does Bob’s old record show a different department?” They don’t realize it’s the same department, just renamed

Problem 2: The “immutable” audit log needed updates

  • When departments merged, we had to update thousands of audit rows which defeats the purpose of an “immutable” audit trail

Problem 3: Simple queries became impossible

  • “Show me everyone who’s ever been in the Engineering department”
  • But wait, is that “Engineering”, “Engineering & AI”, or both?
  • String matching on department names is a nightmare

What Relational Algebra Reveals

Query: Show audit trail with current department names

πemail,department_name,manager_name,changed_at(user_audit)

Problems:

1. department_name is STALE

  • Department “Engineering” was renamed to “Engineering & AI”
  • Audit shows old name, but users expect current name
  • To get current names, we need to join anyway:

πemail,d.name,m.name,changed_at( user_audit ⋈ departments d ⋈ users m )

Here we’re joining to other tables anyway! So we didn’t really save any effort from copying the names.

2. Immutability is violated

  • If department name changes, we have three bad choices:
    • Update audit rows (violates immutability)
    • Leave them stale (misleading audit trail)
    • Store both old and current (now we need two columns, and which is “truth”?)

3. Can’t query “when did user change departments”

  • We need: σold_dept_name ≠ new_dept_name(user_audit)
  • But dept_name is a string that could be:
    • “Engineering” vs “engineering” (case difference)
    • “Engineering” vs “Engineering & AI” (rename, not a move)
    • Can’t tell if it’s the same department or different!

Example 3: The Soft Delete Pattern

CREATE TABLE posts (
    id INT PRIMARY KEY,
    title VARCHAR(200),
    content TEXT,
    author_id INT,
    deleted_at TIMESTAMP  -- NULL = active, timestamp = deleted
);

CREATE TABLE comments (
    id INT PRIMARY KEY,
    post_id INT REFERENCES posts(id),
    user_id INT,
    text TEXT,
    deleted_at TIMESTAMP
);

The pitch: “We never actually delete data! It’s safer, we can undelete things, and we have a full audit trail.”

The Problems

Problem 1: Every query needs filtering

  • Want active posts? WHERE deleted_at IS NULL
  • Want active comments? WHERE deleted_at IS NULL
  • Forgot it once? Data leak.
  • Now every single query comes with this one WHERE clause

Problem 2: Unique constraints are broken

  • User email should be unique but soft-deleted user still has their email in the table
  • New user tries to register with same email → constraint violation!
  • Workaround: UNIQUE(email) WHERE deleted_at IS NULL. But now uniqueness is conditional, which violates relational model principles

Problem 3: Referential integrity doesn’t match application semantics

  • comments.post_id REFERENCES posts(id) allows references to deleted posts
  • Should comments on deleted posts be visible? Hidden? Cascade-deleted?
  • The database says “this reference is valid” but your application has to enforce additional rules
  • You’ve pushed referential integrity logic into application code

Problem 4: Performance degradation

  • Indexes include deleted rows (never queried, just bloat)
  • Every query scans past thousands of deleted rows which is absolutely unnecessary and will affect performance
  • Full table scans get slower over time
  • You’re essentially archiving data in the hot table

What Relational Algebra Reveals

Query: Show all active posts

σdeleted_at=NULL(Posts)

Problems:

1. Selection predicate required on EVERY query

  • Forget it once → data leak
  • Can’t enforce this at schema level
  • It’s a runtime check, not a design constraint

2. Join ambiguity

Posts ⋈ Comments

Should this return:

  • Active posts with active comments?
  • Active posts with all comments (including deleted)?
  • All posts with active comments?

The schema doesn’t say! It’s ambiguous!

3. Uniqueness constraint can’t be expressed

∀u1,u2 ∈ σdeleted_at=NULL(Users): u1 ≠ u2 → u1.email ≠ u2.email

This is a CONDITIONAL constraint (only on subset where deleted_at IS NULL). Relational model assumes constraints apply to entire relation. We’ve violated the model.

What relational algebra tells us:

If we express what we actually need:

Active posts: πtitle,content(Posts)
Archived posts: πtitle,content,deleted_at(ArchivedPosts)


I think we’ve seen enough examples of how easy it is to mess up common decisions.

A Methodology to Prevent These Mistakes

Before jumping in and creating tables, let’s follow these steps:

  1. Gather requirements (queries and constraints)
  2. Express the queries in relational algebra for a clearer picture
  3. Identify relations and attributes from relational algebra expressions
  4. Apply normalization rules
  5. Validate with relational algebra (check if requirements can be expressed cleanly)
  6. Implement in SQL

Let’s take the first example to understand this better - the products table.

Step 1: Gather Requirements

Queries

  • Find all products by category
  • Find all red boots in size 10
  • List all available colors
  • Count products per category

Constraints

  • Every product must have a category
  • Every boot must have a size
  • Color must be from a pre-defined set of colors

Step 2: Express in Relational Algebra

Query 1: Find all products by category

σcategory=‘boots’(Products)

  • Need Products relation (obviously)
  • Need category as an attribute—can’t use JSONB
  • Need index on category

Query 2: Find all red boots in size 10

σcolor=‘red’ ∧ size=‘10’(Boots)

Notice: No category='boots' predicate needed. The fact that we’re querying the Boots relation is the category filter. This is a signal from relational algebra: if category is always the same for a subset of products, that subset should be its own relation.

  • Need color and size attributes
  • Multiple selection predicates—can consider a compound index

Query 3: List all available colors

πcolor(Products)

  • Need Colors as a separate relation
  • If Colors is simply an attribute, then DISTINCT has to be used which will comparitively be slower

Query 4: Count products per category

Γcategory; COUNT(*)->count(Products)

  • Need category as a groupable attribute

Constraint Check

∀p ∈ σcategory=‘boots’(Products): p.size ≠ NULL

Can we express this in the schema?

  • If size is in Products table: Need conditional NOT NULL (impossible!)

Conclusion: Boots should be a separate relation.

Step 3: Identify Relations and Attributes

Attributes used in selection must be columns

σcolor=‘red’ ∧ size=‘10’

Attributes needed: color, size

Attributes used in projection must be columns

πcolor(Products)

Attributes needed: color

Joins indicate foreign key relationships

Products ⋈ Boots

Foreign key needed: Boots.product_id REFERENCES Products.id

Constraints dictate relation structure

∀p ∈ σcategory=‘boots’(Products): p.size ≠ NULL

Decision: Boots should be a separate relation. This allows size column to have a NOT NULL constraint.

Identified Relations

1. Products(id, name, category_id)

  • Primary key: id
  • Foreign key: category_id → Categories.id

2. Categories(id, name)

  • Primary key: id
  • Unique: name

3. Boots(product_id, color_id, size_id, material_id)

  • Primary key: product_id
  • Foreign keys:
    • product_id → Products.id
    • color_id → Colors.id
    • size_id → Sizes.id
    • material_id → Materials.id

4. Colors(id, name, hex_code)

  • Primary key: id
  • Unique: name

5. Sizes(id, name, sort_order)

  • Primary key: id
  • Unique: name

Step 4: Apply Normalization Rules

Let’s say you have:

CREATE TABLE order_items (
    order_id INT,
    product_id INT,
    customer_name VARCHAR(100),
    product_name VARCHAR(100),
    quantity INT,
    PRIMARY KEY (order_id, product_id)
);

Try expressing this business rule in RA:

∀oi1, oi2 ∈ order_items:  
 oi1.order_id = oi2.order_id → oi1.customer_name = oi2.customer_name

“All items in the same order must have the same customer name.”

Can this be enforced in SQL? No. There’s no CHECK constraint that can span rows.

Conclusion: You have a functional dependency (order_id → customer_name) that’s not captured by your primary key. This violates 2NF.

The fix:

  • Split into orders(order_id, customer_name) and order_items(order_id, product_id, quantity)
  • Now the functional dependency is enforced by the primary key of orders

The pattern: If a constraint can be expressed in relational algebra but can’t be enforced in SQL, it’s probably a normalization issue.

Step 5: Validate with Relational Algebra

Rewrite the expressions from earlier with actual schema relations and check if it’s natural and efficient.

Expression from step 2:

σcolor=‘red’ ∧ size=‘10’(Boots)

Expression with actual schema relations:

πp.name, p.id(
  σc.name=‘red’ ∧ s.name=‘10’(
  Products p
  ⋈(p.id = b.product_id) Boots b
  ⋈(b.color_id = c.id) Colors c
  ⋈(b.size_id = s.id) Sizes s
 )
)

Notice what’s missing: No category='boots' filter. We’re querying Boots directly its existence as a relation is the category filter. This is exactly what relational algebra tells us: repeated selection predicates on the same value suggest the need for a separate relation.

Validation Questions

  • Are all operations standard RA? Yes (σ, π, ⋈)
  • Are joins natural? Yes (all on foreign key integer columns)
  • Can database optimize this? Yes
    • Can use indexes on color_id, size_id
    • Can use index on category_id
    • Query planner has statistics on all columns
    • Can reorder joins for efficiency
  • Does it handle edge cases?
    • What if product has no color? Inner join excludes it (correct!)
    • What if color doesn’t exist? FK prevents invalid color_id (correct!)
    • What if size is NULL? NOT NULL constraint prevents it (correct!)

Conclusion: Schema validated!

Bonus: From Relational Algebra to Indexes

Here’s something cool: RA expressions map directly to index strategy.

When you see:

σcolor=‘red’ ∧ size=‘10’(Boots)

The ∧ (AND) tells you these filters are evaluated together. In SQL:

WHERE color_id = X AND size_id = Y

Index strategy from RA:

  • Single predicate → Single column index
  • Multiple predicates with ∧ → Compound index
  • Predicates with ∨ (OR) → Separate indexes (or bitmap scan)

For our boots query:

CREATE INDEX idx_boots_color_size ON boots(color_id, size_id);

Why this order? The RA expression doesn’t tell you… but query selectivity does. For example, if color filters out 90% of rows and size filters out 50%, put color first. The database can use the index for:

  • Just color: ✓
  • Color + size: ✓
  • Just size: ✗ (needs separate index)

Rule of thumb: Order index columns from most selective to least selective, and match the order you’ll most commonly query them.

Using Validation to Identify Bad Schema

Let’s take an example. And for this example, we’re going back to the previous schema:

CREATE TABLE products (
  id INT PRIMARY KEY,
  name VARCHAR(255),
  attributes JSONB
);

Trying to express the same query in relational algebra:

σattributes->>‘category’=‘boots’ ∧ attributes->>‘color’=‘red’ ∧ attributes->>‘size’=‘10’

Validation Questions:

  • Are all operations standard RA? No (->> is a function call, not an attribute reference)
  • Can database optimize this? No, database can’t use regular indexes on JSON keys
  • Does it handle edge cases?
    • There’s no type safety, can’t enforce that color values are from a set of predefined colors
    • No constraint enforcement, can’t ensure if boots will have a size or material

Conclusion: Schema failed validation.

Step 6: Implementing in SQL (Finally!)

-- Core entities
CREATE TABLE categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) UNIQUE NOT NULL
);

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    category_id INT NOT NULL REFERENCES categories(id),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Define the schema for the attributes
CREATE TABLE colors (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50) UNIQUE NOT NULL,
    hex_code CHAR(7) NOT NULL
);

CREATE TABLE sizes (
    id SERIAL PRIMARY KEY,
    name VARCHAR(10) UNIQUE NOT NULL,
    sort_order INT NOT NULL
);

CREATE TABLE materials (
    id SERIAL PRIMARY KEY,
    name VARCHAR(50) UNIQUE NOT NULL
);

-- Product category-specific attributes
CREATE TABLE shirts (
    product_id INT PRIMARY KEY REFERENCES products(id) ON DELETE CASCADE,
    color_id INT NOT NULL REFERENCES colors(id),
    size_id INT NOT NULL REFERENCES sizes(id),
    material_id INT REFERENCES materials(id)
);

CREATE TABLE boots (
    product_id INT PRIMARY KEY REFERENCES products(id) ON DELETE CASCADE,
    color_id INT NOT NULL REFERENCES colors(id),
    size_id INT NOT NULL REFERENCES sizes(id),
    width VARCHAR(10) CHECK (width IN ('Narrow', 'Medium', 'Wide'))
);

-- Indexes for common queries (from RA analysis)
CREATE INDEX idx_products_category ON products(category_id);
CREATE INDEX idx_boots_color_size ON boots(color_id, size_id);

Conclusion

Now maybe in some cases (or even in this one) there are better ways to design the schema, but you really can’t design a bad one at least after going through this tedious process and I think that’s a huge win.

Was it fun? No.
Was it as rewarding as I expected it to be? Also no.

But it worked. It removed guesswork, exposed bad assumptions, and left me with a database design I actually trust.