Implementing Celery for Async Work Order Batching: Resolving Duplicate Ingestion on Broker Failover

When routing parsed maintenance tickets into a CMMS, the move from synchronous validation to asynchronous execution introduces one sharp failure point: broker connection loss during task acknowledgment. Facilities managers and maintenance engineers repeatedly report duplicate preventive maintenance routes after a network blip or a Redis restart. The Celery worker finishes a batch of work orders but never gets to acknowledge it before the broker drops the connection, so on reconnect the broker redelivers the same payload and the async batch processing stage writes the records a second time. This page isolates the exact configuration flaw and gives a deterministic fix you can paste into a running pipeline.

Incident Profile

The symptom is a duplicate key integrity error in the ingestion worker, raised seconds after a broker reconnect rather than during steady-state load. During a failover event the Celery worker log shows a recognizable sequence: the task is received, marked succeeded locally, then received and run a second time once the connection recovers.

[2024-05-12 08:14:22,103: WARNING/MainProcess] Connection to Redis lost: Retry (0/20) now.
[2024-05-12 08:14:22,105: INFO/MainProcess] Task cmms_pipeline.batch_work_orders[8a3f-4c1d] received
[2024-05-12 08:14:22,410: INFO/MainProcess] Task cmms_pipeline.batch_work_orders[8a3f-4c1d] succeeded in 0.305s
[2024-05-12 08:14:23,001: WARNING/MainProcess] Connection to Redis lost: Retry (1/20) now.
[2024-05-12 08:14:25,112: INFO/MainProcess] Task cmms_pipeline.batch_work_orders[8a3f-4c1d] received
[2024-05-12 08:14:25,415: ERROR/MainProcess] IntegrityError: duplicate key value violates unique constraint "work_orders_pkey"

Two log lines give it away: the same task id (8a3f-4c1d) is received twice, and the database raises IntegrityError only on the second run. The task completed its work the first time, but the acknowledgment (ACK) never reached the broker because the connection was already gone. On recovery the broker still held the message in its unacknowledged set and redelivered it.

Root Cause Analysis

Celery defaults to early acknowledgment (task_acks_late=False). With late acknowledgment disabled, the broker considers a task consumed the instant it is delivered to the worker — before the task function has run, let alone committed. If the broker restarts or the connection partitions before the worker can confirm the work, the message is still sitting in the broker’s unacknowledged queue and is redelivered the moment the worker reconnects.

That redelivery is not a bug to be eliminated; in any distributed broker it is a guarantee you design around. The real defect here is that the ingestion task is not idempotent. The CMMS database has no native message-id tracking, so it cannot tell a redelivered batch apart from a genuinely new one, and a naive INSERT raises IntegrityError (or, worse, silently duplicates rows if the table has no unique constraint). Broker-level deduplication alone cannot close this gap — the contract has to be enforced at the boundary where rows are written.

The fix therefore has two halves that must ship together: defer the ACK until the work is durably committed (so a crash before commit re-runs the task, which is what we want), and make the write itself idempotent (so a redelivery after commit is a no-op). Both halves rely on the same canonical work order contract used across the Work Order Ingestion & Parsing Pipelines domain, so the SLA fields stay intact from intake through dispatch.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Dict, List, Optional


class Priority(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    STANDARD = "standard"
    PLANNED = "planned"


@dataclass
class WorkOrderPayload:
    """Canonical CMMS work order — SLA fields are mandatory site-wide."""
    work_order_id: str
    asset_id: str
    part_skus: List[str]
    required_quantities: Dict[str, int]
    priority: Priority = Priority.STANDARD
    requested_completion: Optional[datetime] = None
    escalation_tier: int = 0
    status: str = "open"
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

Resolution

Before: early acknowledgment, non-idempotent insert

The configuration that produces the incident is usually just the Celery defaults plus an unguarded insert. Nothing here is obviously wrong, which is exactly why it survives code review and fails in production.

# celery_config.py — DEFAULTS THAT CAUSE THE DUPLICATE
task_acks_late = False          # broker marks the task done on delivery, not on commit
task_reject_on_worker_lost = False
worker_prefetch_multiplier = 4  # worker hoards messages it may never ACK

@app.task
def batch_work_orders(tickets: list[dict]):
    with engine.begin() as conn:
        # A plain INSERT: a redelivered batch re-runs this and collides on the PK.
        conn.execute(insert_stmt, tickets)

After: late acknowledgment + deterministic idempotency keys

Three changes turn the redelivery into a harmless no-op. Each is commented inline so the intent is unambiguous in review.

# celery_config.py — FAILOVER-SAFE CONFIGURATION
broker_transport_options = {
    "visibility_timeout": 3600,   # >= longest batch runtime; stops premature redelivery
    "retry_on_timeout": True,
}
task_acks_late = True             # defer ACK until the task RETURNS (after DB commit)
task_reject_on_worker_lost = True # requeue, don't ACK, if the worker dies mid-batch
worker_prefetch_multiplier = 1    # critical: fetch one batch at a time during failover

broker_connection_retry_on_startup = True
broker_connection_max_retries = 10
broker_pool_limit = 10

With task_acks_late = True, the ACK is sent only after batch_work_orders returns. If the worker crashes before commit, the unacknowledged message is redelivered and the batch runs again — correct, because nothing was written. If it crashes after commit but before the ACK lands (the incident above), the redelivery still happens, so the write must absorb it. That is the job of a deterministic idempotency key derived from stable ticket metadata, paired with a unique constraint and ON CONFLICT DO NOTHING.

import hashlib


def generate_idempotency_key(ticket: WorkOrderPayload) -> str:
    """Deterministic per-ticket key — identical input always yields the same key,
    so a redelivered batch produces the same keys and collides harmlessly."""
    raw = f"{ticket.work_order_id}|{ticket.asset_id}|{ticket.created_at.isoformat()}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

from sqlalchemy import text

# Requires: ALTER TABLE work_orders ADD CONSTRAINT uq_idem UNIQUE (idempotency_key);
INSERT_STMT = text("""
    INSERT INTO work_orders
        (idempotency_key, work_order_id, asset_id, priority,
         requested_completion, escalation_tier, status, created_at)
    VALUES
        (:key, :wo_id, :asset, :priority,
         :requested_completion, :escalation_tier, :status, :created)
    ON CONFLICT (idempotency_key) DO NOTHING   -- redelivery becomes a no-op
""")

The combination yields at-least-once delivery with exactly-once effect: the broker may hand the batch over any number of times, but the SLA-bearing row is written at most once. This is the same guarantee the parent stage relies on when it hands trade-grouped batches to the broker, and it mirrors the idempotency contract enforced by json schema validation for work order payloads at the upstream boundary.

Minimal Reproducible Pipeline

The script below is self-contained: it defines the Celery app, the idempotent task, and the canonical payload, then dispatches a batch. Run a broker (redis-server) and a worker (celery -A mre worker -l info), call enqueue(), and force a failover with redis-cli DEBUG SLEEP 5 mid-batch to confirm the second delivery does not duplicate rows.

# mre.py — end-to-end, idempotent Celery batcher
import hashlib
from datetime import datetime, timezone

from celery import Celery
from sqlalchemy import create_engine, text

app = Celery("cmms_pipeline", broker="redis://localhost:6379/0")
app.conf.update(
    task_acks_late=True,
    task_reject_on_worker_lost=True,
    worker_prefetch_multiplier=1,
    broker_transport_options={"visibility_timeout": 3600, "retry_on_timeout": True},
    broker_connection_retry_on_startup=True,
)

engine = create_engine("postgresql+psycopg2://cmms_user:pass@localhost:5432/cmms_prod")

INSERT_STMT = text("""
    INSERT INTO work_orders
        (idempotency_key, work_order_id, asset_id, priority,
         requested_completion, escalation_tier, status, created_at)
    VALUES (:key, :wo_id, :asset, :priority, :requested_completion,
            :escalation_tier, :status, :created)
    ON CONFLICT (idempotency_key) DO NOTHING
""")


def idem_key(t: dict) -> str:
    raw = f"{t['work_order_id']}|{t['asset_id']}|{t['created_at']}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]


@app.task(bind=True, max_retries=3, default_retry_delay=60)
def batch_work_orders(self, tickets: list[dict]):
    try:
        with engine.begin() as conn:  # commit happens here; ACK fires after return
            conn.execute(
                INSERT_STMT,
                [
                    {
                        "key": idem_key(t),
                        "wo_id": t["work_order_id"],
                        "asset": t["asset_id"],
                        "priority": t.get("priority", "standard"),
                        "requested_completion": t.get("requested_completion"),
                        "escalation_tier": t.get("escalation_tier", 0),
                        "status": t.get("status", "open"),
                        "created": t["created_at"],
                    }
                    for t in tickets
                ],
            )
    except Exception as exc:
        # Retry transient DB locks only — idempotency conflicts never reach here.
        raise self.retry(exc=exc, countdown=2 ** self.request.retries)


def enqueue() -> None:
    now = datetime.now(timezone.utc).isoformat()
    batch = [
        {
            "work_order_id": "WO-10241",
            "asset_id": "AHU-03",
            "priority": "high",
            "requested_completion": now,
            "escalation_tier": 1,
            "status": "open",
            "created_at": now,
        }
    ]
    batch_work_orders.delay(batch)


if __name__ == "__main__":
    enqueue()

To verify the fix held, query both counts after forcing a redelivery — they must be equal: SELECT COUNT(*), COUNT(DISTINCT idempotency_key) FROM work_orders;. Confirm in the worker log that the ACK line appears after the transaction commits, and watch celery_task_retries_total in your metrics: a stable pipeline shows zero IntegrityError spikes across a broker failover window.

Prevention Checklist

Set task_acks_late = True and task_reject_on_worker_lost = True so a worker crash requeues the batch instead of dropping it.
Pin worker_prefetch_multiplier = 1 for ingestion workers — over-fetching multiplies the duplicate blast radius during failover.
Add a UNIQUE constraint on idempotency_key and write with ON CONFLICT DO NOTHING; never rely on the primary key alone to catch redeliveries.
Derive idempotency keys from stable payload fields (work order id, asset id, created-at) — never from wall-clock time at insert.
Set visibility_timeout above your longest batch runtime so a slow batch is not redelivered while still in flight.

FAQ

Why not just deduplicate at the broker level? A broker can drop retransmissions of the same delivery, but it cannot tell that a redelivered message after a reconnect produced effects the first time around. Exactly-once effect has to be enforced where the side effect lives — the CMMS database — via the idempotency key and unique constraint.

Does late acknowledgment risk losing work orders? No. With task_acks_late = True, an unacknowledged message stays in the broker until the task returns. A crash before commit simply re-runs the batch; the idempotent upsert means a crash after commit is harmless on redelivery.

Will ON CONFLICT DO NOTHING hide real validation errors? It only suppresses collisions on idempotency_key. Schema and field errors should already be rejected upstream; pair this stage with json schema validation so malformed payloads never reach the insert.

How does this interact with retries on transient DB locks? The self.retry() path covers transient failures like lock timeouts. Because the key is deterministic, a retried batch re-derives the same keys and converges on the same rows — retries and redeliveries are both safe.

Return to the parent stage in async batch processing for the full batcher and envelope contract, compare the duplicate-ingestion race at the inbound edge in configuring IMAP polling for maintenance email queues, and see how structured intake feeds this stage from email intake configuration and NLP intent classification. For the schema rules that keep these payloads idempotent end to end, see json schema validation for work order payloads.

Part of: Work Order Ingestion & Parsing Pipelines