PDF Parsing with Python for CMMS Work Order Ingestion

PDF parsing is the document-decoding stage of the Work Order Ingestion & Parsing Pipelines domain — the component that converts vendor service reports, OEM equipment manuals, and emailed maintenance requests into the same canonical work order payload every later stage already understands.

Facilities operations generate a continuous stream of PDFs, and manual transcription of those documents creates dispatch bottlenecks, introduces data-entry errors, and delays corrective action. This component closes that gap. It does not perform routing, scoring, or dispatch; it takes a single PDF file and emits one validated, schema-conformant payload (or a clearly-flagged failure) so that downstream routing can run without human intervention. The objective is not generic optical character recognition or text dumping — it is deterministic field isolation that produces a structured envelope with measurable extraction confidence. This guide implements that stage end to end: prerequisites, the input/output data contract, a step-by-step Python build, a configuration reference, validation checks, and the failure modes you will actually hit against real vendor layouts.

Prerequisites

This component runs as a stateless worker: it consumes a file reference from the intake queue, parses it, and hands a payload to the next stage. Before you deploy it, confirm the following are in place.

Python 3.11+ with pdfplumber>=0.11 for coordinate-aware text and table extraction, and pydantic>=2.6 if you want to validate the emitted payload at the stage boundary. No OCR engine is required for digitally-generated PDFs; scanned image-only documents are out of scope for this component and should be branched to an OCR pre-stage.
A file source, not a polling loop. Files arrive as references on a message broker after upstream capture — typically from email intake configuration, which strips the attachment and enqueues a path or object key. The parser never scans a filesystem directory directly.
CMMS REST API v1 read access to the asset registry (GET /api/v1/assets/{asset_id}) so extracted asset identifiers can be cross-referenced before a work order is created. The parser holds read-only scope; it never writes to the registry.
Environment variables: CMMS_BASE_URL, CMMS_API_TOKEN (scope assets:read), PARSE_TIMEOUT_SECONDS (default 8), MIN_ROUTING_CONFIDENCE (default 0.6), and DEAD_LETTER_PATH for documents that fail extraction. A token missing assets:read fails closed at startup rather than silently routing unverified asset IDs.

Architecture and Data Contract

The component sits at the strict boundary between document receipt and routing logic. It consumes raw PDF bytes and emits a single canonical work order payload wrapped in an extraction result that carries confidence and provenance — never the reverse, and never a partially-parsed half-record that leaks into the dispatch layer. Three boundaries keep the stage honest.

Ingestion boundary: a document enters the parser only as a file reference plus its source channel. The parser trusts nothing about layout and treats every vendor template as untrusted input subject to a hard timeout.
Extraction boundary: raw page text is reduced to a WorkOrderPayload using bounded, non-greedy patterns that enforce field edges, so adjacent data blocks cannot cross-contaminate. Each matched field contributes to an aggregate extraction_confidence.
Routing boundary: the wrapped result is validated against the CMMS contract. A payload missing critical keys, or scoring below MIN_ROUTING_CONFIDENCE, is routed to manual triage or to semantic enrichment instead of being dispatched. The parsing scope terminates the moment the result crosses this boundary.

The contract across the extraction boundary is explicit. The input is the byte content of one PDF plus a source_channel label; the output is an ExtractionResult carrying a populated WorkOrderPayload and the metadata a router needs to decide what to do with it. Modeling the output as a typed object means a missing asset identifier, an unknown priority string, or an empty extraction is surfaced as state at the boundary rather than as a malformed dispatch downstream. Field-level schema rules for the payload itself live in work order schema standards.

Step-by-Step Implementation

1. Define the canonical work order payload

The parser populates the same WorkOrderPayload used everywhere on this site, imported rather than redefined so the SLA fields stay identical across every stage. The SLA fields — priority, requested_completion, and escalation_tier — are exactly what later scoring reads to decide how urgently a job must move, so the parser must populate them from the document or fall back to safe defaults.

from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Any, Dict, List, Optional


class Priority(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    STANDARD = "standard"
    PLANNED = "planned"


@dataclass
class WorkOrderPayload:
    """Canonical CMMS work order — SLA fields are mandatory site-wide."""
    work_order_id: str
    asset_id: str
    part_skus: List[str]
    required_quantities: Dict[str, int]
    priority: Priority = Priority.STANDARD
    requested_completion: Optional[datetime] = None
    escalation_tier: int = 0
    status: str = "open"
    created_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

The parsing stage wraps that payload in an ExtractionResult so confidence and provenance travel with the record instead of being logged and lost. The router reads the wrapper; every later stage reads the inner payload.

@dataclass
class ExtractionResult:
    """Parser output: the canonical payload plus extraction provenance."""
    payload: WorkOrderPayload
    fault_description: Optional[str] = None
    location_code: Optional[str] = None
    requested_by: Optional[str] = None
    source_channel: str = "unknown"
    extraction_confidence: float = 0.0
    raw_metadata: Dict[str, Any] = field(default_factory=dict)

2. Extract text with pdfplumber under a hard timeout

pdfplumber is the recommended library for coordinate-based extraction because it preserves the spatial relationships that regex-only approaches break on multi-column vendor forms. Wrap the open-and-read in a guard so a malformed or multi-hundred-page OEM manual cannot stall the worker. Concatenate page text with explicit newlines so field patterns can anchor on line boundaries.

import logging
import pdfplumber

logger = logging.getLogger("pdf_parser")


def extract_text(pdf_path: str) -> str:
    """Return concatenated page text, or '' if the document yields nothing."""
    pages_text: List[str] = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            pages_text.append(page.extract_text() or "")
    full_text = "\n".join(pages_text)
    if not full_text.strip():
        logger.warning("Empty text extraction for %s (likely scanned image)", pdf_path)
    return full_text

3. Isolate fields with bounded, non-greedy patterns

Greedy patterns are the most common cause of cross-field contamination, where a fault description swallows the requestor line below it. Every pattern below anchors on a labelled prefix, constrains the captured character class, and terminates on a blank line or end of input. Each successful match contributes a confidence weight so the aggregate reflects how much of the record was recovered deterministically.

import re

ASSET_RE = re.compile(r"(?:Asset|Equipment)\s*ID[:\s]+([A-Z0-9\-]{4,12})", re.IGNORECASE)
LOC_RE = re.compile(r"(?:Loc|Bldg|Zone|Area|Room)[:\s]+([A-Z0-9\-]{3,10})", re.IGNORECASE)
PRIORITY_RE = re.compile(r"(?:Priority|Urgency|Severity)[:\s]+([A-Za-z0-9\-]+)", re.IGNORECASE)
DESC_RE = re.compile(
    r"(?:Description|Fault|Issue|Symptom|Findings)[:\s]+(.+?)(?:\n{2,}|$)",
    re.IGNORECASE | re.DOTALL,
)
REQ_RE = re.compile(
    r"(?:Requestor|Submitted by|Contact|Initiated by)[:\s]+(.+?)(?:\n|$)",
    re.IGNORECASE,
)


def isolate_fields(full_text: str) -> Dict[str, Any]:
    """Return raw field values plus a confidence weight per recovered field."""
    fields: Dict[str, Any] = {}
    weights: List[float] = []

    if (m := ASSET_RE.search(full_text)):
        fields["asset_id"] = m.group(1).strip()
        weights.append(1.0)
    if (m := LOC_RE.search(full_text)):
        fields["location_code"] = m.group(1).strip()
        weights.append(1.0)
    if (m := PRIORITY_RE.search(full_text)):
        fields["priority_raw"] = m.group(1).strip()
        weights.append(0.9)
    if (m := DESC_RE.search(full_text)):
        fields["fault_description"] = m.group(1).strip()
        weights.append(0.85)
    if (m := REQ_RE.search(full_text)):
        fields["requested_by"] = m.group(1).strip()
        weights.append(0.9)

    fields["confidence"] = sum(weights) / max(len(weights), 1)
    return fields

4. Normalize vendor priority strings to the SLA vocabulary

Vendors write CRITICAL, Emergency, URGENT, and a dozen other strings; the SLA fields demand the canonical Priority vocabulary. Map deterministically and treat an unrecognized string as STANDARD rather than guessing, while lowering confidence so the router can decide whether to enrich it.

PRIORITY_MAP = {
    "CRITICAL": Priority.CRITICAL, "EMERGENCY": Priority.CRITICAL,
    "URGENT": Priority.HIGH, "HIGH": Priority.HIGH,
    "MEDIUM": Priority.STANDARD, "NORMAL": Priority.STANDARD,
    "LOW": Priority.PLANNED, "PLANNED": Priority.PLANNED,
}


def normalize_priority(raw: Optional[str]) -> tuple[Priority, bool]:
    """Return (priority, recognized). Unknown strings degrade to STANDARD."""
    if not raw:
        return Priority.STANDARD, False
    mapped = PRIORITY_MAP.get(raw.upper().strip())
    if mapped is None:
        return Priority.STANDARD, False
    return mapped, True

5. Assemble the ExtractionResult

Combine the recovered fields into the canonical payload. The parser does not yet know the part requirements — those are resolved later against the registry — so it emits empty part_skus and lets parts availability checks populate them downstream. A document with no recoverable asset identifier still produces a result, flagged by its confidence, so nothing is silently dropped.

import uuid


def build_result(pdf_path: str, source_channel: str = "email") -> ExtractionResult:
    full_text = extract_text(pdf_path)
    if not full_text.strip():
        empty = WorkOrderPayload(
            work_order_id=f"WO-{uuid.uuid4().hex[:10]}",
            asset_id="",
            part_skus=[],
            required_quantities={},
        )
        return ExtractionResult(payload=empty, source_channel=source_channel,
                                extraction_confidence=0.0,
                                raw_metadata={"reason": "empty_extraction"})

    fields = isolate_fields(full_text)
    priority, recognized = normalize_priority(fields.get("priority_raw"))
    confidence = fields["confidence"] * (1.0 if recognized else 0.7)

    payload = WorkOrderPayload(
        work_order_id=f"WO-{uuid.uuid4().hex[:10]}",
        asset_id=fields.get("asset_id", ""),
        part_skus=[],
        required_quantities={},
        priority=priority,
    )
    return ExtractionResult(
        payload=payload,
        fault_description=fields.get("fault_description"),
        location_code=fields.get("location_code"),
        requested_by=fields.get("requested_by"),
        source_channel=source_channel,
        extraction_confidence=round(confidence, 3),
        raw_metadata={"priority_recognized": recognized},
    )

6. Validate the result and decide its route

The wrapped result must clear the CMMS contract before it can be dispatched. A missing asset identifier or location, or a confidence below MIN_ROUTING_CONFIDENCE, routes the record to review instead of dispatch. Free-form narratives that defeated the patterns are handed to NLP intent classification, which assigns work_type and trade codes without blocking the pipeline. Clean records flow on to async batch processing for scoring and dispatch.

MIN_ROUTING_CONFIDENCE = 0.6


class RoutingStatus(str, Enum):
    READY = "ready"
    PENDING_REVIEW = "pending_review"
    FAILED = "failed"


def route(result: ExtractionResult) -> RoutingStatus:
    p = result.payload
    if not p.asset_id or not result.location_code:
        logger.warning("Missing critical keys for %s; routing to triage", p.work_order_id)
        return RoutingStatus.PENDING_REVIEW
    if result.extraction_confidence < MIN_ROUTING_CONFIDENCE:
        logger.info("Low confidence %.2f for %s; flagging for NLP enrichment",
                    result.extraction_confidence, p.work_order_id)
        return RoutingStatus.PENDING_REVIEW
    return RoutingStatus.READY

Tabular data — parts consumption lists, calibration certificates, multi-row maintenance logs — needs more than coordinate text extraction. That problem has its own walkthrough in extracting tables from PDF work orders using pdfplumber, which covers header alignment, row merging, and column validation tuned to maintenance documentation.

Configuration Reference

Parameter	Accepted values	Default	CMMS-specific notes
`PARSE_TIMEOUT_SECONDS`	`1`–`30`	`8`	Hard cap per document; multi-hundred-page OEM manuals must not stall the worker pool.
`MIN_ROUTING_CONFIDENCE`	`0.0`–`1.0`	`0.6`	Below this, records go to review or NLP enrichment instead of dispatch.
`source_channel`	`email`, `portal`, `scanner`, `api`	`email`	Stamped onto the result so provenance survives into the dispatch record.
`ASSET_RE` charset	regex	`[A-Z0-9\-]{4,12}`	Widen the length bound only after auditing real asset tags against the asset hierarchy design.
`DEAD_LETTER_PATH`	filesystem / object path	—	Destination for `FAILED` documents pending manual review.
`vertical_strategy`	`lines`, `text`	`lines`	Switch to `text` for templates that draw columns with whitespace rather than vector rules.

Validation and Testing

Validate the stage with a fixed sample document whose fields you know, asserting both the extracted values and the routing decision. Because the parser is a pure transformation over file bytes, the same input always yields the same result, which makes these assertions stable in CI.

def test_known_work_order():
    result = build_result("fixtures/wo_sample_clean.pdf", source_channel="email")
    assert result.payload.asset_id == "PUMP-04812"
    assert result.payload.priority == Priority.CRITICAL
    assert result.location_code == "BLDG-3A"
    assert result.extraction_confidence >= 0.6
    assert route(result) == RoutingStatus.READY


def test_unmappable_priority_degrades():
    result = build_result("fixtures/wo_odd_priority.pdf")
    # An unrecognized priority string falls back to STANDARD and lowers confidence.
    assert result.payload.priority == Priority.STANDARD
    assert result.raw_metadata["priority_recognized"] is False

A correct READY route also emits a log line you can assert against in an integration test:

[2026-06-28 09:02:11] INFO pdf_parser - WO-9f3c1a22e0 READY conf=0.94 channel=email asset=PUMP-04812

Failure Modes and Troubleshooting

Empty extraction on every document from one vendor

Confirm the PDF is digitally generated, not a scanned image — extract_text() returns "" for image-only pages, which means the document belongs in an OCR pre-stage, not this parser.
Open the file in pdfplumber and inspect page.chars; an empty list confirms there is no text layer to recover.
Branch image-only sources to OCR before this stage rather than lowering MIN_ROUTING_CONFIDENCE, which would only push blank records downstream.

Fault description swallows the requestor line

Check that DESC_RE still terminates on \n{2,}; a vendor template that uses single newlines between blocks defeats the blank-line boundary and the greedy .+? runs on.
Add an explicit stop alternation for the next known label so the description cannot cross into the requestor field.
Re-run the known-document assertion in CI so a template change that reintroduces the bleed fails the build.

Asset IDs extracted but rejected by the registry

Verify the captured tag matches the registry format before widening ASSET_RE; a too-loose charset captures page numbers or revision codes that look like asset IDs.
Cross-reference against the CMMS asset registry to avoid creating duplicate assets, and confirm the token actually carries assets:read.
Align the captured pattern with the canonical tag scheme defined in the asset hierarchy design.

Worker memory climbs while processing large manuals

Confirm PARSE_TIMEOUT_SECONDS is enforced and that pdfplumber.open runs inside a with block so page objects are released.
Process and discard pages in a generator rather than holding every page object in a list when documents routinely exceed a few hundred pages.
Run the parser as a stateless worker so a leaked document tears down with the process instead of accumulating across the pool.

Frequently Asked Questions

Why use bounded regular expressions instead of an LLM for field extraction?

For labelled, template-driven vendor forms, bounded patterns are deterministic, auditable, and fast — the same document always yields the same payload, which is what automated dispatch requires. A language model is the right tool only for the free-form narratives that defeat the patterns, and the pipeline already routes those to NLP intent classification rather than running every document through a model. Deterministic extraction first, semantic enrichment second.

Why does the parser leave part_skus empty instead of extracting them from the PDF?

Parts on a vendor report are frequently informal descriptions, not catalog SKUs, and resolving them is a registry lookup, not a parsing problem. The parser emits empty part_skus and lets parts availability checks reconcile descriptions against the catalog downstream, which keeps the parsing stage stateless and avoids baking stale SKU mappings into extraction code.

How should scanned, image-only PDFs be handled?

This component targets digitally-generated PDFs that carry a text layer. A scanned document returns empty text and should be branched to an OCR pre-stage that produces a text layer before it reaches this parser. Lowering the confidence threshold to accept blank records only pushes the problem downstream.

What confidence threshold should trigger manual review?

Start at 0.6 and tune against observed triage volume. A clean five-field form scores near 0.95; a document that recovers only the asset ID and a degraded priority lands well under the threshold and is correctly held for enrichment or review. Lower the bound only after confirming the records it would admit are genuinely routable.

Receive documents from email intake configuration, enrich free-form narratives with NLP intent classification, hand clean payloads to async batch processing, and dig into grid layouts with extracting tables from PDF work orders using pdfplumber.

Part of: Work Order Ingestion & Parsing Pipelines.