Async Batch Processing for CMMS Work Order Routing

Async batch processing is the routing-and-dispatch stage of the Work Order Ingestion & Parsing Pipelines domain — the component that takes already-normalized maintenance requests and consolidates them into prioritized, trade-specific batches a worker pool can dispatch without overwhelming downstream CMMS endpoints.

By the time a payload reaches this stage it has already been captured, decoded, and mapped to the canonical schema upstream; this component does not parse anything. It aggregates discrete events into bounded processing windows, scores them, and hands trade-grouped batches to a message broker so that corrective and preventive work orders reach the right crews inside their service-level windows. Treating dispatch as an asynchronous batch problem — rather than a synchronous call per ticket — is what keeps a burst of inbound requests from saturating the CMMS API, starving low-volume queues, or double-dispatching the same job after a worker crash. This guide implements that stage end to end: prerequisites, the input/output data contract, a step-by-step Python build, a configuration reference, validation checks, and the failure modes you will actually hit in production.

Prerequisites

This component runs as a long-lived service: an accumulator that forms batches and a pool of workers that consume them. Before you deploy it, confirm the following are in place.

Python 3.11+ with pydantic>=2.6 for the batch envelope contract, pika>=1.3 for the RabbitMQ transport (or redis>=5.0 if you publish to Redis Streams), and redis>=5.0 for the deduplication cache that enforces idempotency. No parsing libraries are needed here — payloads arrive already structured.
A message broker reachable over the network: RabbitMQ with a durable queue and a paired dead-letter exchange, or Redis Streams with a consumer group. The broker must persist messages so an in-flight batch survives a broker restart.
CMMS REST API v1 with write access to the dispatch resource (POST /api/v1/work-orders/dispatch) and read access to the trade roster (GET /api/v1/crews). The routing logic reads crew certifications and on-shift status; it never writes back to the asset registry.
Environment variables: CMMS_BASE_URL, CMMS_API_TOKEN, BROKER_URL, REDIS_URL, BATCH_MAX_SIZE (default 50), and BATCH_WINDOW_SECONDS (default 30). The token must carry the workorders:dispatch and crews:read scopes; a token missing workorders:dispatch fails closed at startup rather than silently dead-lettering every batch.

Architecture and Data Contract

The component sits between schema normalization and crew dispatch. It consumes a stream of validated, single work order payloads and emits batches of routing-resolved work — never the reverse. Four boundaries keep the stage honest and stop ingestion volatility from leaking into the dispatch layer:

Ingestion boundary: a payload enters the accumulator only after upstream validation; the batcher trusts the schema and never re-parses transport artifacts. Whether the request originated from email intake configuration or a portal submission, it crosses this boundary in the same canonical shape.
Batching boundary: the accumulator groups payloads into a WorkOrderRoutingEnvelope when either the size threshold or the time window is reached, whichever comes first. This dual threshold caps memory during a burst while still flushing low-volume periods promptly.
Routing boundary: a worker consumes one envelope, scores each order, partitions by trade group, and resolves a dispatch target. Same envelope, same roster snapshot, same dispatch decision, every time.
Dispatch boundary: the resolved batch is posted to the CMMS dispatch endpoint under at-least-once delivery, paired with an idempotency key so a redelivered envelope can never double-dispatch. The routing scope terminates the moment the CMMS acknowledges the batch.

The contract across the batching boundary is explicit. The input is a stream of WorkOrderPayload objects; the output is a WorkOrderRoutingEnvelope carrying a bounded list of those payloads plus the routing metadata a worker needs. Encoding the envelope as a pydantic model means an oversized batch, an empty batch, or an unknown priority tier is rejected at the boundary instead of producing a malformed dispatch.

Step-by-Step Implementation

1. Define the canonical work order payload

Routing operates on the same WorkOrderPayload used everywhere on this site, imported rather than redefined so the SLA fields stay identical across the stage. The SLA fields — priority, requested_completion, and escalation_tier — are exactly what the scoring step reads to decide how urgently a batched job must move. Field-level schema rules live in work order schema standards.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Dict, List, Optional


class Priority(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    STANDARD = "standard"
    PLANNED = "planned"


@dataclass
class WorkOrderPayload:
    """Canonical CMMS work order — SLA fields are mandatory site-wide."""
    work_order_id: str
    asset_id: str
    part_skus: List[str]
    required_quantities: Dict[str, int]
    priority: Priority = Priority.STANDARD
    requested_completion: Optional[datetime] = None
    escalation_tier: int = 0
    status: str = "open"
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

2. Define the batch envelope contract

The envelope is the unit of transport across the broker. Modeling it with pydantic enforces the size bounds and the priority vocabulary at construction time, so a batch that violates the contract never reaches the queue. Each envelope carries a correlation_id that becomes the idempotency key for the whole batch.

import uuid
from datetime import datetime, timezone
from typing import List, Literal
from pydantic import BaseModel, Field


class WorkOrderRoutingEnvelope(BaseModel):
    """Output contract: one bounded, trade-targeted batch ready for dispatch."""
    correlation_id: uuid.UUID = Field(default_factory=uuid.uuid4)
    batch_id: str
    created_at: datetime = Field(default_factory=lambda: datetime.now(timezone.utc))
    orders: List[dict] = Field(..., min_length=1, max_length=50)
    priority_tier: Literal["critical", "high", "standard", "planned"]
    target_trade_group: str
    sla_deadline: datetime

3. Accumulate payloads into deterministic batches

The accumulator flushes on whichever threshold is met first — a full batch or an elapsed window. The size cap bounds memory during a burst; the time window guarantees that a single low-priority ticket arriving at 2 a.m. still dispatches within a known latency instead of waiting for the batch to fill.

import time
from typing import Callable, List


class WorkOrderBatcher:
    def __init__(
        self,
        flush: Callable[[List[WorkOrderPayload]], None],
        max_size: int = 50,
        window_seconds: float = 30.0,
    ) -> None:
        self.flush = flush
        self.max_size = max_size
        self.window_seconds = window_seconds
        self._buffer: List[WorkOrderPayload] = []
        self._window_started_at = time.monotonic()

    def add(self, order: WorkOrderPayload) -> None:
        """Buffer one order, flushing when the size threshold is reached."""
        self._buffer.append(order)
        if len(self._buffer) >= self.max_size:
            self._flush_now()

    def tick(self) -> None:
        """Call on a scheduler tick; flushes a partial batch once the window elapses."""
        window_open = self._buffer and (
            time.monotonic() - self._window_started_at >= self.window_seconds
        )
        if window_open:
            self._flush_now()

    def _flush_now(self) -> None:
        batch, self._buffer = self._buffer, []
        self._window_started_at = time.monotonic()
        if batch:
            self.flush(batch)

4. Score, partition by trade group, and build the envelope

A flushed batch is rarely homogeneous, so the builder scores each order, groups by required trade, and emits one envelope per trade group. Scoring is deterministic: the same orders and the same clock reference always yield the same ordering, which is what makes the stage reproducible. Corrective urgency and preventive timing both feed the score — the cadence that fires preventive jobs is set upstream by PM interval calculation, and the asset metadata that picks the trade group comes from asset hierarchy design.

from collections import defaultdict
from datetime import datetime, timezone
from typing import Dict, List


PRIORITY_WEIGHT = {
    Priority.CRITICAL: 1000,
    Priority.HIGH: 100,
    Priority.STANDARD: 10,
    Priority.PLANNED: 1,
}


def score_order(order: WorkOrderPayload, now: datetime) -> float:
    """Higher score dispatches first; combines priority and SLA urgency."""
    base = PRIORITY_WEIGHT[order.priority]
    if order.requested_completion is not None:
        seconds_left = (order.requested_completion - now).total_seconds()
        # Tighter deadlines add urgency; overdue work goes strongly positive.
        base += max(0.0, 86_400.0 - seconds_left) / 3_600.0
    return base + order.escalation_tier


def trade_group_for(order: WorkOrderPayload) -> str:
    """Map an asset prefix to a certified trade queue."""
    prefix = order.asset_id.split("-", 1)[0].upper()
    return {
        "HVAC": "HVAC", "CHL": "HVAC",
        "ELE": "ELECTRICAL", "SWG": "ELECTRICAL",
        "PMP": "PLUMBING", "VLV": "PLUMBING",
    }.get(prefix, "GENERAL_MAINT")


def build_envelopes(batch: List[WorkOrderPayload]) -> List[WorkOrderRoutingEnvelope]:
    """Partition a scored batch into one envelope per trade group."""
    now = datetime.now(timezone.utc)
    grouped: Dict[str, List[WorkOrderPayload]] = defaultdict(list)
    for order in sorted(batch, key=lambda o: score_order(o, now), reverse=True):
        grouped[trade_group_for(order)].append(order)

    envelopes: List[WorkOrderRoutingEnvelope] = []
    for trade, orders in grouped.items():
        top = orders[0]
        envelopes.append(
            WorkOrderRoutingEnvelope(
                batch_id=f"{trade}-{int(now.timestamp())}",
                orders=[o.__dict__ for o in orders],
                priority_tier=top.priority.value,
                target_trade_group=trade,
                sla_deadline=top.requested_completion or now,
            )
        )
    return envelopes

5. Publish each envelope to the broker

Envelopes are serialized to JSON and published with persistent delivery so an in-flight batch survives a broker restart. The message_id is set to the correlation_id, which the consumer later uses as the deduplication key.

import pika


def publish_envelope(envelope: WorkOrderRoutingEnvelope, broker_url: str, queue: str) -> None:
    """Publish one durable, persistent envelope to the routing queue."""
    connection = pika.BlockingConnection(pika.URLParameters(broker_url))
    try:
        channel = connection.channel()
        channel.queue_declare(queue=queue, durable=True)
        channel.basic_publish(
            exchange="",
            routing_key=queue,
            body=envelope.model_dump_json(),
            properties=pika.BasicProperties(
                delivery_mode=2,  # Persistent — survives a broker restart.
                content_type="application/json",
                message_id=str(envelope.correlation_id),
            ),
        )
    finally:
        connection.close()

6. Consume idempotently with acknowledgment and dead-lettering

Workers consume with manual acknowledgment so a crash mid-dispatch redelivers the envelope to the next consumer — at-least-once delivery. Idempotency turns that guarantee into exactly-once effect: before dispatching, the worker claims the correlation_id in Redis with a TTL matching the SLA window. A second delivery finds the key already set, acknowledges, and skips. Envelopes that exhaust their retries are routed to a dead-letter queue for review rather than dropped.

import json
import logging
from typing import Callable

import redis

logger = logging.getLogger(__name__)


def make_consumer(redis_url: str, dedup_ttl: int, dispatch: Callable[[dict], None]):
    cache = redis.Redis.from_url(redis_url)

    def on_message(channel, method, properties, body) -> None:
        correlation_id = properties.message_id
        # SET NX returns False if the key already exists -> duplicate delivery.
        claimed = cache.set(f"wo:dispatch:{correlation_id}", "1", nx=True, ex=dedup_ttl)
        if not claimed:
            logger.info("skip duplicate envelope %s", correlation_id)
            channel.basic_ack(delivery_tag=method.delivery_tag)
            return
        try:
            dispatch(json.loads(body))
            channel.basic_ack(delivery_tag=method.delivery_tag)
            logger.info("dispatched envelope %s", correlation_id)
        except Exception:
            # Release the claim so a genuine retry can re-run, then dead-letter.
            cache.delete(f"wo:dispatch:{correlation_id}")
            channel.basic_nack(delivery_tag=method.delivery_tag, requeue=False)
            logger.exception("dead-lettered envelope %s", correlation_id)

    return on_message

For a production-grade worker pool with retry policy, result tracking, and autoscaling rather than a hand-rolled consumer, the foundation is laid out in implementing Celery for async work order batching.

Configuration Reference

Keep every tunable in a version-controlled configuration registry, not in the worker source. The defaults below are conservative starting points for a multi-building facility.

Parameter	Accepted values	Default	CMMS-specific notes
`batch_max_size`	`1`–`50`	`50`	Hard cap enforced by the envelope schema; larger batches reduce dispatch calls but raise per-batch blast radius on failure.
`batch_window_seconds`	`5`–`300`	`30`	Upper bound on dispatch latency for a partial batch; tighten for critical queues, loosen for planned PM.
`broker_url`	AMQP / Redis URL	—	Must point at a durable queue with a paired dead-letter exchange.
`prefetch_count`	`1`–`100`	`10`	Per-worker unacknowledged limit; set to `1` for slow CMMS endpoints to avoid head-of-line blocking.
`dedup_ttl_seconds`	`300`–`86400`	`3600`	TTL of the idempotency key; should equal or exceed the longest SLA window so a late redelivery is still caught.
`max_retries`	`0`–`10`	`5`	Redeliveries before an envelope is dead-lettered; pair with backoff to avoid hammering the CMMS API.
`backoff_base_seconds`	`0.5`–`30`	`2`	Base for exponential backoff with jitter applied between retries.
`dlq_name`	any queue name	`routing.dlq`	Dead-letter target; review its depth daily for systemic tag or schema problems.

Validation and Testing

Routing must be reproducible, so the highest-value test asserts that the same batch and clock reference always partition and order identically. A single deterministic assertion catches accidental nondeterminism — an unstable sort key, a clock-dependent default — before it reaches production.

from datetime import datetime, timedelta, timezone


def test_batching_is_deterministic_and_trade_partitioned():
    now = datetime.now(timezone.utc)
    orders = [
        WorkOrderPayload("WO-1001", "HVAC-12", ["FILT-1"], {"FILT-1": 1},
                         priority=Priority.CRITICAL,
                         requested_completion=now + timedelta(hours=2)),
        WorkOrderPayload("WO-1002", "PMP-204", ["SEAL-9"], {"SEAL-9": 2},
                         priority=Priority.STANDARD),
        WorkOrderPayload("WO-1003", "HVAC-15", ["BELT-3"], {"BELT-3": 1},
                         priority=Priority.HIGH),
    ]
    a = build_envelopes(orders)
    b = build_envelopes(orders)
    by_trade = {e.target_trade_group: e for e in a}

    assert [e.target_trade_group for e in a] == [e.target_trade_group for e in b]
    assert set(by_trade) == {"HVAC", "PLUMBING"}
    # Within HVAC, the critical order outranks the high-priority one.
    assert by_trade["HVAC"].orders[0]["work_order_id"] == "WO-1001"
    assert by_trade["HVAC"].priority_tier == "critical"

On a successful run each worker emits a single structured log line per envelope — dispatched envelope 7f3a... — which is the canonical signal that a batch reached the CMMS. A redelivery of an already-processed batch logs skip duplicate envelope ... and acknowledges without re-dispatching; seeing that line is healthy, not an error. Assert against both log lines in integration tests to verify the full accumulate-to-dispatch path, and confirm the dead-letter queue stays empty for a clean batch.

Failure Modes and Troubleshooting

Expand each scenario for the root cause, the diagnostic signal, and the fix. The checklist items render as interactive checkboxes — work through them in order.

The same work order is dispatched twice

Confirm the publisher sets message_id to the envelope correlation_id; without it the consumer has no stable deduplication key and every redelivery re-dispatches.
Check that dedup_ttl_seconds is at least as long as the SLA window — a key that expires before a slow redelivery arrives lets the duplicate through.
Verify the worker uses SET NX to claim the key before dispatching, not after; claiming after the CMMS call leaves a race window on crash.

A partial batch never dispatches

Confirm a scheduler actually calls tick() on a fixed interval; the size threshold alone never flushes a buffer that stays below batch_max_size.
Check batch_window_seconds is not set absurdly high for a low-volume queue, which makes a single ticket appear stuck.
Verify the flush callback is wired to build_envelopes + publish_envelope and is not swallowing exceptions silently.

The dead-letter queue is filling up

Inspect a dead-lettered body — a recurring pydantic ValidationError on orders length or priority_tier points at a malformed upstream payload, not a transport fault.
Look for crews:read or workorders:dispatch scope errors; an under-scoped token fails every dispatch and dead-letters the whole stream.
If the CMMS API is returning 429, the batch size or prefetch_count is too aggressive — lower both and confirm exponential backoff with jitter is applied between retries.

Workers stampede the CMMS API after an outage

Confirm backoff uses jitter, not a fixed delay; synchronized retries from every worker recreate the thundering-herd surge that caused the outage.
Reduce prefetch_count so a single worker cannot drain the queue and fire dozens of concurrent dispatch calls.
Cache the crew roster and asset metadata locally and refresh on a schedule so recovery does not also hammer the read endpoints — the parts side of that lookup is covered by parts availability checks.

Frequently Asked Questions

Why batch dispatches instead of routing each work order as it arrives?

A per-ticket synchronous call ties throughput to the slowest downstream response and turns an intake burst into a denial-of-service against your own CMMS API. Batching amortizes connection overhead, lets you score and trade-partition a window of work together, and gives the broker a durable unit to redeliver on failure. The dual-threshold accumulator keeps latency bounded so batching never strands a single urgent ticket.

How does at-least-once delivery avoid double-dispatching a job?

The broker guarantees an envelope is delivered at least once, which on its own would re-dispatch after a worker crash. The Redis idempotency claim closes that gap: the first worker to SET NX the correlation_id owns the batch, and any redelivery finds the key set and acknowledges without re-running. The combination yields at-least-once delivery with exactly-once effect.

Should I use RabbitMQ or Redis Streams for the routing queue?

Either works behind the same envelope contract. RabbitMQ gives you mature dead-letter exchanges and per-message acknowledgment out of the box, which suits strict SLA routing. Redis Streams is attractive when you already run Redis for the deduplication cache and want one fewer system to operate. Keep the publish and consume code behind a thin interface so the broker choice stays swappable.

How large should a batch be?

Start at the schema cap of 50 and lower it if the CMMS dispatch endpoint rate-limits or if a single failed batch dead-letters too much work at once. Smaller batches shrink blast radius and smooth API load; larger batches cut the number of dispatch calls. Tune batch_max_size and batch_window_seconds together against observed dispatch latency and 429 rates rather than guessing.

Feed this stage from email intake configuration, enrich ambiguous requests with NLP intent classification, extract attached documents through PDF parsing with Python, and scale the worker pool with implementing Celery for async work order batching.

Part of: Work Order Ingestion & Parsing Pipelines.