index

Finishing what we started: Audit trails, soft deletes and time

Last time, I built a methodology and then immediately used it on only one of three examples. If you noticed I left the audit trail and soft delete examples hanging with a diagnosis and no treatment. Partly because the post was already getting long, and partly because the audit trail problem led me somewhere I wasn’t ready to write about yet: temporal modeling.

I’ve since gone down that rabbit hole. It’s deeper than I expected, and honestly a bit humbling. Turns out “preserve what things looked like at a point in time” is a problem the database community has been formally thinking about for decades and people keep reinventing broken versions of the solution.

So this post finishes what the last one started. Both examples, all six steps, no punting.

Revisiting Example 2: The denormalized audit trail

The culprit:

CREATE TABLE user_audit (
    audit_id INT PRIMARY KEY,
    user_id INT,
    email VARCHAR(100),
    department_name VARCHAR(100),  -- Copied from departments table
    manager_name VARCHAR(100),     -- Copied from users table
    changed_at TIMESTAMP,
    changed_by INT
);

Initial approach was we keep exactly what things looked like in the moment of chage thus showing the accurate values at that particular point, like maintaining a history.

Revisiting the problems

Problem 1: The audit trail becomes misleading

At some point in time the department_name changed and thus changing the results from the query and also creates confusion while going through the audit trail.

The solution (sort of):

Start referencing department names by identity instead of simply copying them. What I mean by that is, we store department_id and put the history of department names in a separate relation. When we’ll have to resolve “what was this department called when this audit row was written” we can simply look up the version of the name that valid at changed_at.

The two terms that made everything click for me:

  • Valid time — when the fact was true in the real world. (“Engineering became Engineering & AI on March 1st.”)
  • Transaction time — when the database knew about it. (“We recorded the rename on March 5th, four days late.”)

A relation that tracks both is called bitemporal.

This is some extra information Google (basically Gemini) gave me:

SQL:2011 actually standardized “system-versioned tables” for this, but PostgreSQL and MySQL still don’t really support it natively, so most people roll their own with a history table.

For an audit trail you usually only need valid time — you find out about a rename more or less the moment it happens — so I’m going to stick with that for the rest of the post.

This single change also takes care of Problem 2 (“the immutable audit log needed updates”) and Problem 3 (“simple queries became impossible”), because both of them are downstream of the same mistake. Renaming a department is now an insert into the history table, not an UPDATE that touches a thousand audit rows. And queries like “everyone ever in Engineering” stop being string-matching nightmares because they go through a reliable unique ID which can then be indexed.

OK. Solution stated. Now let me actually walk through the methodology, because I tried to skip ahead the first time and got the design wrong twice.

Step 1: Gather Requirements

Queries

  • Show me Bob’s history of department changes
  • Show me everyone who was in the Engineering department on Jan 1, 2024
  • Show me what the audit log looked like as of last Friday
  • Show me who renamed the Engineering department, and when

Constraints

  • Audit entries are immutable. Once written, never modified.
  • Renaming a department must NOT alter what old audit rows display.
  • Every department referenced by an audit row must have existed at the time of the change (not necessarily under the same name).

Step 2: Express in Relational Algebra

Query 1: Bob’s department history

σuser_id=bob.id(UserAudit)

Plain selection. Nothing temporal yet, this one is easy.

Query 2: Everyone in Engineering on Jan 1, 2024

This was something that made my pause Spotify and really think for once. “Engineering on Jan 1, 2024” is itself a temporal lookup — first I have to figure out what was the department named “Engineering” on that date, then ask who was in it.

σname=‘Engineering’ ∧ valid_from ≤ ‘2024-01-01’ ∧ (valid_to > ‘2024-01-01’ ∨ valid_to IS NULL)(DepartmentHistory)

(This basically means, find the DepartmentHistory records where the name is “Engineering” and that record has started being valid on or before “1st Jan 2024” and either doesn’t expire until after 1st Jan 2024 or doesn’t expire at all)

Then a join with UserAudit anchored to that same point in time. Two temporal selections and a join. Not bad once you write it down, but I definitely wrote it three different wrong ways first.

Query 3: Who renamed Engineering, and when?

σold.name=‘Engineering’ ∧ new.name≠‘Engineering’ ∧ old.department_id = new.department_id(DepartmentHistory old ⋈ DepartmentHistory new)

This is basically a self-join over the history of one entity. Once I had this figured out I noticed it’s the same shape as “who changed users’ departments”. They’re both change detection over a history relation.

Constraint check

aUserAudit: there exists a row in DepartmentHistory where dh.department_id = a.department_id AND a.changed_at falls inside [valid_from, valid_to).

This is the part that still confuses me. It’s a real constraint and it can’t be expressed as simply as saying department.id has to exist. I mean, technically this can be expressed but it won’t be accurate as we also need to verify that correct historical version of the department existed at the time the audit record was created. This can’t be enforced natively through SQL and needs some custom implementation.

Step 3: Identify Relations and Attributes

The audit row no longer carries the string department_name. It carries department_id (the stable identity) and changed_at (the moment we anchor time to). Names live on a separate history relation.

1. Departments(id) — just the identity.

2. DepartmentHistory(id, department_id, name, valid_from, valid_to)

  • One row per “version” of a department’s name
  • valid_to NULL means “current version”
  • Foreign key: department_id → Departments.id

3. UserAudit(audit_id, user_id, department_id, changed_by, changed_at)

  • department_id, NOT department_name
  • changed_at is the temporal anchor used to resolve which name was valid

Step 4: Apply Normalization Rules

The functional dependency (department_id, valid_from) → name belongs in DepartmentHistory. It does not belong in Departments, and it definitely does not belong duplicated into UserAudit. The key thing to notice here is to notice that name is not a property of Departments anymore.

I had an intrusive thought and tried to sketch that plan out to see how that looked - having the “current version” of department name in Departments and storing the older versions in DepartmentHistory, which I realised pretty soon is getting unnecessarily complicated, having a single source of truth is ideal in this scenario.

Step 5: Validate with Relational Algebra

Re-express Query 2 against the actual relations:

πu.email(
  σdh.name=‘Engineering’ ∧ dh.valid_from ≤ ‘2024-01-01’ ∧ (dh.valid_to > ‘2024-01-01’ ∨ dh.valid_to IS NULL)(DepartmentHistory dh)
  ⋈(dh.department_id = ua.department_id) UserAudit ua
  ⋈(ua.user_id = u.id) Users u
)

Validation Questions

  • Standard operations? Yes (σ, π, ⋈)
  • Indexable? Yes — a compound index on (department_id, valid_from, valid_to) makes the point-in-time lookup cheap
  • Edge cases:
    • Department renamed → old audit rows still resolve correctly via department_id joining the right history version. ✓
    • Department “deleted” → set valid_to rather than deleting the row. The id stays valid forever. ✓
    • Future-dated rename → drop a row with valid_from > NOW(), queries respect it automatically. ✓

The thing I find most satisfying: Problem 2 from earlier — “we had to update thousands of audit rows when departments merged” — is gone. A merge is one or two writes to DepartmentHistory. UserAudit is never touched.

Step 6: Implementing in SQL

CREATE EXTENSION IF NOT EXISTS btree_gist;

CREATE TABLE departments (
    id SERIAL PRIMARY KEY
);

CREATE TABLE department_history (
    id SERIAL PRIMARY KEY,
    department_id INT NOT NULL REFERENCES departments(id),
    name VARCHAR(100) NOT NULL,
    valid_from TIMESTAMPTZ NOT NULL,
    valid_to TIMESTAMPTZ,
    CHECK (valid_to IS NULL OR valid_to > valid_from),
    UNIQUE (department_id, valid_from),
    EXCLUDE USING gist (
        department_id WITH =,
        tstzrange(
            valid_from,
            COALESCE(valid_to, 'infinity'),
            '[)'
        ) WITH &&
    )
);

CREATE TABLE user_audit (
    audit_id SERIAL PRIMARY KEY,
    user_id INT NOT NULL REFERENCES users(id),
    department_id INT NOT NULL REFERENCES departments(id),
    changed_by INT NOT NULL REFERENCES users(id),
    changed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_dh_dept_validity
ON department_history(department_id, valid_from, valid_to);

CREATE INDEX idx_audit_user_time
ON user_audit(user_id, changed_at);

The EXCLUDE USING gist is something new that I learned while generating writing this query and honestly this is doing more work than it looks. It’s saying: “for any given department_id, no two history rows are allowed to have overlapping validity ranges.” That single constraint is what prevents “two names valid at once” — without it, I’d have written a trigger. I had no idea this existed before this post.

I get the feeling this is going to keep happening: half of “you need application logic for that” actually turns out to be a constraint I just didn’t know about.

Revisiting Example 3: The soft delete pattern

The culprit:

CREATE TABLE posts (
    id INT PRIMARY KEY,
    title VARCHAR(200),
    content TEXT,
    author_id INT,
    deleted_at TIMESTAMP  -- NULL = active, timestamp = deleted
);

Initial pitch was: “we never lose data, and we get an audit trail for free.” Going through the temporal modeling stuff above, this now looks much more obviously broken to me than it did the first time around. You don’t actually get an audit trail — deleted_at only tells you when something was deleted, not what changed before, and not by whom and you don’t know what it was earlier either. It’s the cheapest possible version of history, and it pays for that with a tax on every read query forever.

Revisiting the problems

Here’s what I didn’t see in part 1: soft delete is a temporal model in denial. deleted_at is just valid_to with a different name and one fewer column.

So instead of patching each of the four problems individually like I did before, the redesign is the same move as the audit trail: split active state from archived state, and make the partition something the schema enforces.

Step 1: Gather Requirements

Queries

  • Show me all active posts
  • Look up a deleted post by id (for moderators reviewing reports)
  • Restore a deleted post
  • Show all active comments on this active post
  • Count posts deleted in the last week (moderation metrics)

Constraints

  • A (author_id, slug) combination must be unique among active posts
  • Restoring a post must not collide with another active post that already took its slug
  • A comment’s lifetime is tied to its post’s lifetime — when a post is deleted, its comments are deleted too

Step 2: Express in Relational Algebra

Query 1: All active posts

ActivePosts

This is the move. Don’t write σdeleted_at IS NULL(Posts) — that is the bug. Write the relation you actually want, and let the schema make it real.

Query 4: Active comments on an active post

σpost_id=X(ActiveComments)

Notice what’s missing: any “deleted_at IS NULL” predicate, and any ambiguity about “which deleted_at?”. Both relations carry their own meaning.

Constraint check: unique active slugs per author

p1, p2 ∈ ActivePosts: p1.author_id = p2.author_id ∧ p1.slug = p2.slug → p1 = p2

This is a regular uniqueness constraint as long as ActivePosts is its own relation. The “conditional uniqueness” thing from part 1 only happened because we were trying to express it on a union of active and deleted rows pretending to be one relation.

Step 3: Identify Relations and Attributes

1. Posts(id, title, slug, content, author_id, created_at) - active posts only.

2. DeletedPosts(id, title, slug, content, author_id, created_at, deleted_at, deleted_by) — archive.

3. Comments(id, post_id, user_id, text, created_at) - post_id references Posts.

4. DeletedComments(id, post_id, user_id, text, created_at, deleted_at) - archive.

A subtle thing: Posts.id and DeletedPosts.id should never collide, because a post might move between them (deletion, restoration). I went with a single shared id sequence — easier to reason about than two independent ones, and it means an id permanently identifies “this post” regardless of which table it currently lives in.

Step 4: Apply Normalization Rules

I went down a small rabbit hole here. Should DeletedPosts.deleted_by be a foreign key to Users.id? What if the user who did the deleting later gets soft-deleted themselves? The FK would point to a row that’s no longer in Users.

This is the moment where one redesign starts trying to drag every other entity into temporal modeling with it. I think the right answer is to know when to stop. You decide up front which entities have full history (probably Users for say, compliance reasons… probably anything with regulatory implications), and which entities only need active/archive (most of the rest). If you don’t make that decision deliberately, you make it implicitly, query by query, with WHERE clauses — and that’s exactly the soft-delete pattern.

For this example I’ll keep it simple: deleted_by references a user id, and we accept that the user might exist in either Users or DeletedUsers. That’s a real tradeoff, not a free lunch.

Step 5: Validate with Relational Algebra

Query 4 (“active comments on an active post”) against the new schema:

πc.text, c.user_id, c.created_at(
  σp.id=X(Posts p ⋈(p.id = c.post_id) Comments c)
)

Validation Questions

  • Standard operations? Yes.
  • Indexable? Yes. Comments(post_id, created_at) covers it.
  • Edge cases:
    • Comment on a deleted post → not in Comments, won’t appear ✓
    • Active comment on a deleted post → can’t exist by construction. Comments.post_id REFERENCES Posts.id, and Posts only contains active posts. Soft-deleting a post requires moving its comments to DeletedComments at the same time — and the FK enforces that atomicity for us. ✓
    • Restoring a deleted post → move from DeletedPosts back to Posts, move related rows back from DeletedComments to Comments. The unique constraint on (author_id, slug) will reject the restoration if some other post took the slug while this one was deleted, which is the behavior we want. ✓

Compare to the original join:

σp.deleted_at IS NULL ∧ c.deleted_at IS NULL(Posts p ⋈ Comments c)

The new version is shorter, indexable and most importantly, not forgettable. There is no version of this query you can write that accidentally returns deleted data, because deleted data isn’t in those relations.

Step 6: Implementing in SQL

CREATE SEQUENCE post_id_seq;
CREATE SEQUENCE comment_id_seq;

CREATE TABLE posts (
    id BIGINT PRIMARY KEY DEFAULT nextval('post_id_seq'),
    title VARCHAR(200) NOT NULL,
    slug VARCHAR(200) NOT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL REFERENCES users(id),
    created_at TIMESTAMP NOT NULL DEFAULT NOW(),
    UNIQUE (author_id, slug)
);

CREATE TABLE deleted_posts (
    id BIGINT PRIMARY KEY,
    title VARCHAR(200) NOT NULL,
    slug VARCHAR(200) NOT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    deleted_at TIMESTAMP NOT NULL DEFAULT NOW(),
    deleted_by INT NOT NULL
);

CREATE TABLE comments (
    id BIGINT PRIMARY KEY DEFAULT nextval('comment_id_seq'),
    post_id BIGINT NOT NULL REFERENCES posts(id) ON DELETE CASCADE,
    user_id INT NOT NULL REFERENCES users(id),
    text TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE TABLE deleted_comments (
    id BIGINT PRIMARY KEY,
    post_id BIGINT NOT NULL,
    user_id INT NOT NULL,
    text TEXT NOT NULL,
    created_at TIMESTAMP NOT NULL,
    deleted_at TIMESTAMP NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_comments_post ON comments(post_id, created_at);
CREATE INDEX idx_deleted_posts_deleted_at ON deleted_posts(deleted_at);

Deletion becomes a transaction: copy the post into deleted_posts, copy the related comments into deleted_comments, delete from posts (which cascades into comments). It’s more work at write time. Reads which are way more frequent in most apps get cleaner, faster, and stop depending on whether the developer remembered the convention.

A counter-argument I genuinely considered: “this is just soft-delete with extra steps”. I thought so too at first. The difference is where the partition lives. With deleted_at, the partition is in the head of every developer who writes a query. With separate relations, the partition is in the schema, and the database enforces it for free.

I think I’ll slowly realise what else I’ve missed here and continue this post later but I think I covered the examples I had missed in the previous.