Field Mapping Strategies

Field mapping is the seam where probabilistic extraction output is reconciled with a deterministic database schema. It sits inside Parsing & Extraction Workflows, directly downstream of clause extraction and directly upstream of canonical storage: an NLP layer emits a loosely-keyed dictionary full of synonyms, partial values, and confidence scores, and the mapping layer must resolve those into a single typed record that a rent roll, an accounting ledger, or a renewal calendar can trust. The specific challenge this page solves is reconciliation under ambiguity — the same business concept arrives under a dozen different keys across jurisdictions and templates, values arrive as free-text strings that must become dates and decimals, and any field can be missing or wrong without warning.

Get this layer wrong and the damage is silent: a mis-mapped base_rent posts the wrong number to the books, a dropped commencement_date breaks every downstream renewal alert, and nobody notices until reconciliation season. This page covers the engineering of doing it safely — a priority-ordered mapping engine, a pydantic v2 validation boundary, confidence-gated fallback, idempotent commits, and the failure modes that quietly corrupt portfolio data at scale.

The Reconciliation Problem

A commercial lease portfolio rarely speaks one language. Leases originate from different brokers, jurisdictions, and decades, and each labels identical concepts differently. Commencement Date, Lease Start, and Effective Date are the same field; CAM, Operating Expenses, and Common Area Maintenance describe the same tenant obligation. Upstream regex and NLP clause extraction does not resolve this — it surfaces whatever the document said, tagged with a confidence score. Mapping is where heterogeneity collapses into one canonical shape.

That target shape is owned by the lease data models layer. The mapping engine’s job is not to invent a schema but to land extraction output on the existing one: flatten nested output, coerce strings into typed primitives, apply metadata normalization standards so dates and currencies are unambiguous, and decide what to do when a field cannot be resolved with enough confidence to commit. Everything below is in service of those four decisions, executed deterministically and auditably.

Prerequisites & Environment Setup

This page assumes a Python 3.11+ environment and the following pinned dependencies. The pydantic v2 field_validator / model_validator syntax is the project convention and is used throughout.

# requirements.txt
pydantic==2.7.1          # typed validation boundary + coercion
python-dateutil==2.9.0   # tolerant date parsing across template formats
babel==2.15.0            # locale-aware currency/number parsing

Assumptions baked into the examples below:

Input shape — extraction emits dict[str, Any] where each value may carry a parallel *_confidence float in [0.0, 1.0]. If your extractor returns a flat value, wrap it so confidence travels with the value.
Idempotency key — every record carries a stable lease_id (or upstream document_id) so commits can be replayed without creating duplicates.
No silent defaults — a missing field becomes None and is logged; it never becomes 0, "", or “today”. Defaulting a rent or a date to a placeholder is how bad data reaches the ledger undetected.

Decouple mapping from any single destination. The same canonical record should serialize to Yardi, MRI, AppFolio, or a custom ERP through per-system adapters — the mapping engine produces one validated payload and never embeds vendor-specific field names.

Mapping Decision Table

Before any code runs, the resolution order for each canonical field should be explicit and version-controlled. A declarative table makes mapping decisions auditable and lets a non-engineer review portfolio terminology without reading Python.

Canonical field	Source keys (priority order)	Type	Coercion	If unresolved
`lease_id`	`lease_id`, `document_id`, `ref_number`	`str`	trim	hard fail (required)
`tenant_name`	`tenant_name`, `lessee`, `occupant`	`str`	trim, collapse whitespace	flag `requires_review`
`lease_effective_date`	`commencement_date`, `start_date`, `effective_date`	`date`	parse → ISO 8601	`None` + review
`base_rent_monthly`	`base_rent`, `monthly_rent`, `rent_amount`	`Decimal`	strip currency, locale-aware	`None` + review
`cam_responsibility`	`cam_type`, `common_area_maintenance`, `operating_expenses`	`str`	normalize vocabulary	`None`
`escalation_type`	`escalation_method`, `rent_increase_type`, `cpi_adjustment`	`str`	normalize vocabulary	`None`

The escalation_type value is a handoff: once normalized here, it tells the escalation formula mapping engine which arithmetic to apply (fixed-step, percentage, or CPI-indexed). Mapping does not compute the escalation — it only resolves which kind of escalation the lease specifies, so the downstream engine can select the right formula.

Primary Implementation

The engine below resolves each canonical field by walking its priority-ordered source keys, coerces values at a single typed boundary, and propagates confidence so the routing layer can make an informed commit/review/reject decision. Currency is parsed into Decimal rather than float — binary floats silently misrepresent money, and a fraction of a cent compounded across a portfolio is a real reconciliation defect.

import logging
from dataclasses import dataclass, field
from datetime import date
from decimal import Decimal, InvalidOperation
from typing import Any, Optional

from dateutil import parser as date_parser
from pydantic import BaseModel, ConfigDict, ValidationError, field_validator

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(message)s",
)
logger = logging.getLogger("field_mapper")

# Canonical vocabularies — extraction synonyms collapse onto these tokens.
CAM_VOCAB = {
    "pro-rata share": "pro_rata",
    "pro rata": "pro_rata",
    "operating expenses": "operating_expense",
    "net": "triple_net",
    "nnn": "triple_net",
}
ESCALATION_VOCAB = {
    "cpi": "cpi_indexed",
    "cpi adjustment": "cpi_indexed",
    "fixed": "fixed_step",
    "stepped": "fixed_step",
    "percentage": "percentage",
}


class LeaseAbstractionTarget(BaseModel):
    # strict=True stops pydantic silently coercing 1 -> "1"; we coerce explicitly.
    model_config = ConfigDict(strict=True, extra="forbid")

    lease_id: str
    tenant_name: str
    lease_effective_date: Optional[date] = None
    base_rent_monthly: Optional[Decimal] = None
    cam_responsibility: Optional[str] = None
    escalation_type: Optional[str] = None

    @field_validator("base_rent_monthly", mode="before")
    @classmethod
    def coerce_currency(cls, v: Any) -> Optional[Decimal]:
        if v is None or isinstance(v, Decimal):
            return v
        if isinstance(v, (int, float)):
            return Decimal(str(v))  # str() avoids binary-float drift
        if isinstance(v, str):
            cleaned = (
                v.replace("$", "").replace(",", "").replace("USD", "").strip()
            )
            # Parenthesised negatives are an accounting convention: (1,200) = -1200
            if cleaned.startswith("(") and cleaned.endswith(")"):
                cleaned = "-" + cleaned[1:-1]
            try:
                return Decimal(cleaned)
            except InvalidOperation:
                logger.warning("Unparseable currency: %r", v)
                return None
        return None

    @field_validator("lease_effective_date", mode="before")
    @classmethod
    def parse_date(cls, v: Any) -> Optional[date]:
        if v is None or isinstance(v, date):
            return v
        if isinstance(v, str):
            try:
                # dayfirst=False: US leases dominate; override per-portfolio.
                return date_parser.parse(v.strip(), dayfirst=False).date()
            except (ValueError, OverflowError):
                logger.warning("Unparseable date: %r", v)
                return None
        return None

    @field_validator("cam_responsibility", "escalation_type", mode="before")
    @classmethod
    def normalize_vocab(cls, v: Any) -> Optional[str]:
        if not isinstance(v, str):
            return v
        token = v.strip().lower()
        return {**CAM_VOCAB, **ESCALATION_VOCAB}.get(token, token)


@dataclass
class MappingResult:
    record: Optional[LeaseAbstractionTarget]
    status: str                       # "commit" | "review" | "rejected"
    missing_fields: list[str] = field(default_factory=list)
    low_confidence: list[str] = field(default_factory=list)
    error: Optional[str] = None


class FieldMapper:
    def __init__(
        self,
        mapping_rules: dict[str, list[str]],
        review_threshold: float = 0.80,
    ):
        # mapping_rules: canonical field -> source keys, highest priority first.
        self.mapping_rules = mapping_rules
        self.review_threshold = review_threshold

    def _resolve(self, raw: dict[str, Any], keys: list[str]) -> tuple[Any, float]:
        """Return the first non-null source value and its confidence."""
        for key in keys:
            if raw.get(key) is not None:
                conf = float(raw.get(f"{key}_confidence", 1.0))
                return raw[key], conf
        return None, 0.0

    def map_and_validate(self, raw: dict[str, Any]) -> MappingResult:
        payload: dict[str, Any] = {}
        missing: list[str] = []
        low_conf: list[str] = []

        for canonical, source_keys in self.mapping_rules.items():
            value, confidence = self._resolve(raw, source_keys)
            payload[canonical] = value
            if value is None:
                missing.append(canonical)
            elif confidence < self.review_threshold:
                low_conf.append(canonical)

        try:
            record = LeaseAbstractionTarget(**payload)
        except ValidationError as exc:
            # Required-field or type failure: never reaches the property DB.
            logger.error("Schema validation failed: %s", exc)
            return MappingResult(None, "rejected", missing, low_conf, str(exc))

        status = "review" if (missing or low_conf) else "commit"
        logger.info("Mapped %s -> %s", record.lease_id, status)
        return MappingResult(record, status, missing, low_conf)


if __name__ == "__main__":
    RULES = {
        "lease_id": ["lease_id", "document_id", "ref_number"],
        "tenant_name": ["tenant_name", "lessee", "occupant"],
        "lease_effective_date": ["commencement_date", "start_date", "effective_date"],
        "base_rent_monthly": ["base_rent", "monthly_rent", "rent_amount"],
        "cam_responsibility": ["cam_type", "common_area_maintenance", "operating_expenses"],
        "escalation_type": ["escalation_method", "rent_increase_type", "cpi_adjustment"],
    }
    mapper = FieldMapper(RULES)
    raw_data = {
        "document_id": "LSE-2024-8891",
        "lessee": "Apex Logistics LLC",
        "start_date": "01/15/2024",
        "base_rent": "$14,500.00",
        "common_area_maintenance": "Pro-rata share",
        "common_area_maintenance_confidence": 0.62,
    }
    result = mapper.map_and_validate(raw_data)
    print(result.status, result.low_confidence)
    if result.record:
        print(result.record.model_dump_json(indent=2))

The engine returns a MappingResult rather than raising or returning a bare record. That distinction matters: the caller needs to know not just what was mapped but whether it can be trusted, so the routing layer can act on status without re-deriving it.

Validation & Quality Gates

Three gates stand between raw extraction and the property database, and a record must clear all three to auto-commit.

Schema validation. The pydantic boundary enforces required fields and types. extra="forbid" rejects unexpected keys so a renamed extraction field surfaces as a loud error instead of being silently dropped. A ValidationError produces status="rejected" and the record is dead-lettered with its original payload for offline triage — exactly the failure-classification contract used by the error handling and retry logic layer.

Confidence gating. Each resolved value carries the confidence of the source key it came from. Any field below the review threshold (0.80 in the example) marks the whole record requires_review. Because a wrong base rent posts real money, hold this gate higher than you would for a pure classification step. Records that validate structurally but fall short on confidence are not rejected — they are diverted, which is the job of fallback routing logic: low-confidence and partially-missing records go to a human review queue instead of either auto-committing or being thrown away.

Idempotent commit. Only status="commit" records reach the database, and they upsert on lease_id so a replayed batch is a no-op rather than a duplicate insert.

def commit_record(conn, result: MappingResult) -> None:
    """Upsert keyed on lease_id so batch replays never duplicate rows."""
    if result.status != "commit" or result.record is None:
        # review -> human queue; rejected -> dead-letter. Handled by the caller.
        return
    r = result.record
    conn.execute(
        """
        INSERT INTO lease_abstractions
            (lease_id, tenant_name, lease_effective_date,
             base_rent_monthly, cam_responsibility, escalation_type)
        VALUES (%(lease_id)s, %(tenant_name)s, %(lease_effective_date)s,
                %(base_rent_monthly)s, %(cam_responsibility)s, %(escalation_type)s)
        ON CONFLICT (lease_id) DO UPDATE SET
            tenant_name          = EXCLUDED.tenant_name,
            lease_effective_date = EXCLUDED.lease_effective_date,
            base_rent_monthly    = EXCLUDED.base_rent_monthly,
            cam_responsibility   = EXCLUDED.cam_responsibility,
            escalation_type      = EXCLUDED.escalation_type
        """,
        r.model_dump(),
    )

Troubleshooting

Concrete failure scenarios that show up against real lease portfolios, with the diagnostic signal and the fix.

Currency parses to the wrong magnitude. A value like 1.200,00 (European grouping) becomes 1.2 under naive comma-stripping. Diagnostic: committed rents that are off by exactly 1000x. Fix: route locale-formatted numbers through babel.numbers.parse_decimal with the lease’s jurisdiction locale rather than blanket-replacing commas.

Dates land in the wrong century or month. 01/02/2024 is January 2 in the US and February 1 in the UK; dateutil will guess. Diagnostic: renewal alerts firing a month early or late. Fix: set dayfirst per portfolio locale and reject two-digit years outright instead of letting the parser infer 19xx vs 20xx.

A renamed extraction key silently drops a field. The extractor starts emitting rent_per_month but the rule lists monthly_rent. Diagnostic: a sudden spike in missing_fields for one field across a whole batch. Fix: extra="forbid" plus a schema-drift monitor that diffs incoming key sets against the rule registry; add the new synonym to the rule, not to the code.

Zero-width and non-breaking spaces defeat exact-match keys. A key arrives as "base rent" and never matches "base rent". Diagnostic: a field that is visibly present in the source but reports as missing. Fix: normalize keys with unicodedata.normalize("NFKC", key) and strip control characters before resolution.

Amendment overrides the base lease’s value. A later amendment changes the rent, but the batch maps base and amendment independently. Diagnostic: the committed rent does not match the operative figure. Fix: do not resolve precedence in the mapping layer — map each document as its own versioned record and let the lease data models layer resolve the effective value by date. Mapping never overwrites in place.

Confidence is present but ignored on synonym hits. A low-priority synonym key resolves but its confidence travels under a differently-named *_confidence key, so the gate reads the default 1.0. Diagnostic: obviously shaky values committing without review. Fix: standardize the confidence key naming contract at the extraction boundary, and assert it in a unit test against synthetic payloads.

Performance & Scale Notes

Mapping is CPU-light relative to OCR or NLP, but it runs on every document, so its per-record cost compounds across a portfolio.

Compile the rule registry once. Build the merged vocabulary dict and per-field key lists at construction time, not per record. Instantiating a fresh FieldMapper inside a loop re-creates them for every lease.
Map in the worker, validate at the boundary. Field mapping is pure-Python and pickles cleanly, so it parallelizes well across the worker pool described in async batch processing. Keep pydantic validation inside the worker so a malformed record fails on its own task rather than poisoning the batch.
Batch the commits. Upserts dominate wall-clock at scale. Group status="commit" records and flush with executemany/COPY on the idempotency key rather than one round-trip per lease.
Treat the rule registry as data. When new terminology appears, update the declarative rules and redeploy configuration — not code. This keeps mapping changes reviewable, testable against synthetic payloads, and decoupled from release cycles.

For a concrete, header-drift-heavy application of these patterns at portfolio onboarding scale, see automating field mapping for rent roll data ingestion, which extends this engine to merged cells, locale-specific formatting, and spreadsheet header normalization.

Frequently Asked Questions

What confidence threshold should trigger manual review? Start at 0.80 for the aggregate score and tune against a labeled holdout. Because a mis-mapped rent or date posts real money or breaks a renewal calendar, hold financial and temporal fields higher than descriptive ones, and divert any record that fails schema validation regardless of its score.

How do I handle lease amendments that override base clauses? Map the amendment as its own versioned record with its own identifier and never overwrite the base lease in place. Precedence is resolved downstream by effective date in the lease data models, which preserves the audit trail mapping would otherwise destroy.

Why use Decimal instead of float for rent? Binary floats cannot represent most decimal cents exactly, so 0.1 + 0.2 is not 0.3. Across a portfolio those fractions compound into reconciliation defects. Parse currency strings straight into Decimal via str() and never round-trip through float.

Should mapping ever invent a default value? No. A missing field becomes None and is logged for review; it never becomes 0, an empty string, or today’s date. Silent defaults are how unverified data reaches the ledger undetected.

Automating Field Mapping for Rent Roll Data Ingestion — extending this engine to header drift, merged cells, and locale-specific spreadsheet formatting.
Regex & NLP Clause Extraction — the upstream stage that emits the confidence-scored, synonym-keyed payloads this layer reconciles.
Fallback Routing Logic — where low-confidence and partially-missing records divert for human review instead of auto-committing.
Metadata Normalization Standards — the date, currency, and vocabulary conventions the coercion layer enforces.
Lease Data Models — the canonical target schema mapped records land on, and where amendment precedence is resolved.

← Back to Parsing & Extraction Workflows

Continue in this section