zylior
← Blog

Anti double-send: make an emailing worker idempotent

An emailing worker that crashes mid-send is normal. What's not normal is a subscriber receiving the same campaign twice on restart. Here are the four mechanisms we stack in prod at Zylior so a recipient gets a message exactly once, even when the process dies between two batches.

The problem: "at least once" is not "exactly once"

A worker ticks every 60 s, claims a batch of 20 recipients, calls the send service, marks the rows as sent. If the process dies after the network call but before the `commit`, the batch goes back to to-send on reboot. With no guardrail, those 20 people get the email twice. Conversely, if you mark `sent` before the call and the call fails, those 20 get nothing. You can't have both guarantees in the same place with a single write — you have to spread them across several layers.

Ground rule: a crash must never re-read work that's already done as if it were new. Idempotency is making the replay of an operation indistinguishable from its single execution. You get it through four points: deterministic identity, atomic transition, database uniqueness, batch lock.

1. A deterministic job_id per recipient

The first mistake is generating a random job id (`uuid()`) on every attempt. On replay it's a new id → the send service sees it as a new message → double-send. The key: derive the id from the (campaign, subscriber) pair, never from a random or a timestamp. Same input, same id, forever. On the send service side (BullMQ, SQS dedup, or your homemade queue), this `job_id` acts as the deduplication key: pushing `campaign:cmp_42:sub_7` twice keeps only one. The replay becomes harmless by construction.

-- job_id = identité stable du couple (campagne, destinataire)
insert into growth_sends(campaign_id, account_id, subscriber_id, email, job_id)
select distinct on (s.email_lower)
       $1, $2, s.id, s.email,
       'campaign:' || $1 || ':' || s.id   -- déterministe, pas de uuid()
  from growth_subscribers s
 where s.account_id = $2 and s.status = 'confirmed'
 order by s.email_lower, s.created_at
on conflict (campaign_id, subscriber_id) do nothing;

2. Atomic status transition (compare-and-set)

Two concurrent ticks (or two replicas of the worker) can see the same `approved` campaign at the same instant. If each one moves it to `sending` and starts the send, you double everything. Compare-and-set fixes this: a single `UPDATE … WHERE status='approved'` wins, the other sees `rowCount = 0` and stops. Postgres serializes the write on the row — no application-level lock needed. Never do a `SELECT status` then a separate `UPDATE`: between the two, another worker slips through. The state condition has to live inside the `WHERE` of the same `UPDATE` — read and write in a single atomic operation.

-- CAS : un seul worker fait basculer la campagne. Les autres voient rowCount=0.
update growth_campaigns
   set status='sending', updated_at=now()
 where status='approved'              -- garde-fou : l'état attendu
   and scheduled_for is not null
   and scheduled_for <= now()
returning id;

3. Uniqueness (campaign, subscriber): the safety net in the database

The first two layers can still give way to a bug. The uniqueness constraint, though, never lies: one `growth_sends` row per `(campaign_id, subscriber_id)` pair, period. You build the recipient list with an `INSERT … ON CONFLICT DO NOTHING` — re-running the build after a crash creates no duplicate, and the worker picks up exactly where it stopped by reading the rows still `queued`.

4. Batch lock: FOR UPDATE SKIP LOCKED

To parallelize without stepping on each other, each worker claims a batch of `queued` rows by locking them. `FOR UPDATE` places the lock; `SKIP LOCKED` says "ignore the rows already taken by another and move on to the next ones." No waiting, no deadlock, no two workers on the same recipient. It's Postgres's native queue pattern.

begin;
select id, subscriber_id, email, job_id
  from growth_sends
 where campaign_id = $1 and status='queued'
 order by id
 limit 20
 for update skip locked;     -- chaque worker prend un lot DISJOINT

-- on marque 'sent' AVANT l'appel réseau, dans la même transaction :
update growth_sends set status='sent', updated_at=now()
 where id = any($lot);
commit;
-- puis seulement : sendBulk(lot).  Crash ici => job_id rend le retry sûr.
The subtle point: you mark `sent` before the network call, inside the transaction. Counterintuitive, but it's the deterministic `job_id` that makes this choice correct — a crash between the `commit` and the send only produces a re-push of the same `job_id`, which the queue deduplicates. Marking `sent` after the send, on the other hand, would have you resend a whole batch on the slightest post-delivery crash. If the `sendBulk` fails outright, you put the rows back to `queued` and pause the campaign with a reason — never any silent loss.

At scale, these four layers turn a fragile worker into a resumable executor: you can `kill -9` it in the middle of a 50,000-recipient campaign and restart it — it resumes the `queued` rows, ignores the ones already `sent`, respects the opt-outs that arrived in the meantime, and nobody gets it twice. No single layer is enough on its own: CAS protects the transition, uniqueness protects the build, `SKIP LOCKED` protects concurrency, and the `job_id` protects the replay at the end of the chain. Start with the uniqueness constraint in the database — it's the cheapest to put in place and the one that saves you when the other three have a bug.

The newsletter

By subscribing you agree to receive the Zylior newsletter. One-click unsubscribe in every email.