Scaling the Outbox Pattern with PostgreSQL - Part 1: Escaping the Dual-Write Dilemma
Welcome to the first part of a three-part series on building and scaling the Transactional Outbox Pattern using PostgreSQL. In this series, we will start from the foundational concepts and progressively tackle the challenges of concurrency, scaling, and real-time event processing.
If you are building microservices or event-driven systems, you have probably run into this problem: How do you save state to your database and publish an event to a message broker at the same time without losing data? Let’s dive in.
The Dual-Write Problem
Imagine a simple user registration flow. When a user signs up, you need to do two things:
- Save the new user to your PostgreSQL database.
- Publish a
UserCreatedevent to a message broker (like RabbitMQ or Kafka) so other services can send a welcome email or provision resources.
// The naive approach - DANGEROUS!
async function registerUser(userData: UserData) {
// 1. Save to database
const user = await db.query(
'INSERT INTO users (email, name) VALUES ($1, $2) RETURNING *',
[userData.email, userData.name]
)
// 2. Publish event to message broker
await messageBroker.publish('UserCreated', user)
return user
}
This is called a “dual-write”. And it’s a trap.
What happens if step 1 succeeds, but step 2 fails because the message broker is temporarily down? You now have a user in your database, but no other service knows about it. No welcome email, no provisioned resources. Inconsistent state.
If you swap the order, you risk publishing the event but failing to save the user to the database. Same problem, different flavor.
The Solution: The Transactional Outbox Pattern
The Outbox Pattern solves this by using a single database transaction to guarantee that both the domain data and the event are saved. Instead of directly publishing to the message broker, you save the event to an “outbox” table in the exact same transaction.
CREATE TABLE outbox_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
aggregate_type VARCHAR(255) NOT NULL,
aggregate_id VARCHAR(255) NOT NULL,
event_type VARCHAR(255) NOT NULL,
payload JSONB NOT NULL,
status VARCHAR(50) DEFAULT 'PENDING',
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
Now, your registration logic looks like this:
async function registerUser(userData: UserData) {
const client = await pool.connect()
try {
await client.query('BEGIN')
// 1. Save to database
const userResult = await client.query(
'INSERT INTO users (email, name) VALUES ($1, $2) RETURNING *',
[userData.email, userData.name]
)
const user = userResult.rows[0]
// 2. Save event to outbox table IN THE SAME TRANSACTION
await client.query(
`INSERT INTO outbox_events
(aggregate_type, aggregate_id, event_type, payload)
VALUES ($1, $2, $3, $4)`,
['User', user.id, 'UserCreated', JSON.stringify(user)]
)
await client.query('COMMIT')
return user
} catch (error) {
await client.query('ROLLBACK')
throw error
} finally {
client.release()
}
}
If the database transaction commits, you are 100% guaranteed that the event is sitting safely in the outbox_events table waiting to be processed.
The Naive Poller
Great, the event is saved. But how does it get to the message broker?
You build a separate background worker (the “Relay” or “Poller”) that constantly queries the outbox_events table for pending events, publishes them, and marks them as done.
async function startNaivePoller() {
setInterval(async () => {
const events = await db.query(
"SELECT * FROM outbox_events WHERE status = 'PENDING' ORDER BY created_at ASC LIMIT 50"
)
for (const event of events.rows) {
try {
// Publish to broker
await messageBroker.publish(event.event_type, event.payload)
// Mark as published
await db.query(
"UPDATE outbox_events SET status = 'PUBLISHED' WHERE id = $1",
[event.id]
)
} catch (err) {
console.error(`Failed to process event ${event.id}`, err)
}
}
}, 1000) // Poll every second
}
The Bottleneck
This basic setup works perfectly for a simple application running on a single instance. But what happens when your application grows and you need to scale horizontally?
You spin up three instances of your service, and suddenly, you have three pollers running simultaneously.
- Lock Contention: All three pollers fire the
SELECT * FROM outbox_eventsquery at the exact same time. - Duplicate Processing: Because they don’t lock the rows they are selecting, all three instances grab the exact same 50 pending events.
- The Result: The same
UserCreatedevent gets published to your message broker three times. Your users receive three welcome emails.
To solve this, we need to lock the rows we are reading so other instances know they are being worked on. But if we just use a standard SELECT ... FOR UPDATE, our workers will block each other, killing our throughput.
In Part 2, we will explore how to solve this concurrency nightmare using PostgreSQL’s powerful row-level locking capabilities, specifically SKIP LOCKED, and when to pull out the big guns with Advisory Locks.
Stay tuned, and savor every step of your coding journey.
