Should I use one flat schema or a schema per property type for lease metadata?

Neither for a mixed portfolio. Use a canonical base for the fields every lease shares plus one validated extension model per property type. The base keeps portfolio queries simple while the extension preserves type fidelity. A flat schema invites silent semantic drift, and separate schemas force N-way unions on every cross-asset report.

What confidence threshold should trigger manual review for cross-property-type normalization?

Tie it to financial impact. The property-type classifier and type-specific money fields such as a retail breakpoint, office CAM method, or multifamily gross potential rent should sit at 0.90 or higher so they escalate eagerly, because a wrong type selects the wrong extension and misreads every field below it.

Standardizing Lease Metadata Normalization Across Property Types

The precise decision this page resolves: when one portfolio mixes retail, office, industrial, and multifamily leases, do you force every asset into a single flat schema, maintain a separate schema per property type, or normalize into a shared canonical base with a typed extension per type? Pick wrong and the failure is not loud — a retail percentage-rent lease and a multifamily gross-rent lease both carry a base_rent field, so a flat schema accepts both and then silently misapplies amortization, triggers false reconciliation breaks, and corrupts portfolio yield. The job is to choose a structure that preserves type-specific contractual intent while still presenting one stable contract to every downstream consumer.

Architectural context

This technique sits inside metadata normalization standards, the stage within Core Architecture & Lease Taxonomy that turns heterogeneous extractor output into one trusted canonical payload before it reaches storage. It runs after a clause classification system has tagged spans and before records land in the lease data models layer. The cross-property-type problem is the hardest case of normalization specifically because identical field names encode different business logic per asset class — so the structural choice here determines whether the escalation formula engine downstream receives a number it can interpret, or one it merely guesses at.

The semantic-drift problem in mixed portfolios

A flat normalization approach assumes identical field names map to identical business logic. In commercial real estate that assumption is simply false. Base rent is the clearest example: in multifamily it is a flat monthly figure per unit; in retail it is a minimum annual guarantee subject to percentage-rent overrides tied to tenant sales; in industrial it is quoted per square foot annually with explicit expense pass-throughs; in office it may layer gross-up clauses and a CAM reconciliation schedule on top. When an extractor emits {"base_rent": 45000} with no type context, every downstream system has to guess which of those four meanings applies. Normalization’s job is to remove the guess by binding each value to its property type at the moment it enters the canonical model.

Comparing the three structural strategies

There is no universally correct schema shape — the trade-off is between query simplicity, type fidelity, and how gracefully the structure absorbs a newly acquired asset class. The table below is the decision matrix for choosing one.

Strategy	How it stores type variance	Best for	Risk if misapplied	Schema evolution cost
Flat lowest-common-denominator	Collapses every type into one shared field set	Single-asset-class portfolios that will never diversify	Silent semantic drift — same field name, four meanings	Low to add fields, catastrophic to reinterpret one
Separate schema per type	A fully independent model per property type	Teams with isolated per-type pipelines and no cross-asset reporting	Cross-portfolio queries need N-way unions; shared fields drift apart	High — every common field changes in N places
Canonical base + typed extensions	Shared base for common fields, a validated extension model per type	Mixed portfolios needing both portfolio-level rollups and type fidelity	Extension namespace bloats if discipline slips	Low — add a new extension model, base untouched

The recommended structure for any portfolio that mixes asset classes is the third: a canonical base carrying the fields every lease shares (lease_id, normalized rent, confidence, provenance) plus a discriminated extensions namespace that holds only the attributes a given property type actually has. Portfolio analytics query the base; type-specific accounting reads the extension. Adding a new asset class means adding one extension model, never touching the base or the existing types.

Recommended implementation

The engine below is a stateless dispatcher built on pydantic v2 — the validation convention used throughout this architecture. It classifies the property type, projects common fields into the canonical base, validates the matching typed extension, and routes anything low-confidence or malformed to a structured queue rather than coercing a default. Monetary fields use Decimal, never float, because cent-level rounding compounds across thousands of leases.

import logging
from decimal import Decimal
from enum import Enum
from typing import Any, Literal, Optional, Union

from pydantic import BaseModel, Field, field_validator, model_validator

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
)
logger = logging.getLogger("lease_type_normalizer")


class PropertyType(str, Enum):
    RETAIL = "retail"
    OFFICE = "office"
    INDUSTRIAL = "industrial"
    MULTIFAMILY = "multifamily"


class RentStructure(BaseModel):
    base_amount: Decimal
    frequency: Literal["monthly", "annual"]
    currency: str = "USD"


# One typed extension per property type — each holds ONLY the attributes
# that asset class actually carries. The base never sees these fields.
class RetailExtension(BaseModel):
    kind: Literal["retail"] = "retail"
    percentage_rent_rate: Optional[Decimal] = None
    breakpoint_sales_threshold: Optional[Decimal] = None


class OfficeExtension(BaseModel):
    kind: Literal["office"] = "office"
    cam_reconciliation_method: Literal["actual", "pro_rata", "capped"]
    nnn_expense_categories: list[str] = Field(default_factory=list)


class IndustrialExtension(BaseModel):
    kind: Literal["industrial"] = "industrial"
    triple_net_pass_throughs: list[str] = Field(default_factory=list)
    loading_dock_allocation_sqft: Optional[Decimal] = None


class MultifamilyExtension(BaseModel):
    kind: Literal["multifamily"] = "multifamily"
    unit_count: int = Field(gt=0)
    gross_potential_rent: Decimal
    vacancy_allowance_pct: Decimal = Field(ge=0, le=1)


# Discriminated union: pydantic selects the right model by the `kind` tag,
# so an office payload can never validate as a retail extension.
Extension = Union[
    RetailExtension, OfficeExtension, IndustrialExtension, MultifamilyExtension
]

EXTENSION_BY_TYPE = {
    PropertyType.RETAIL: RetailExtension,
    PropertyType.OFFICE: OfficeExtension,
    PropertyType.INDUSTRIAL: IndustrialExtension,
    PropertyType.MULTIFAMILY: MultifamilyExtension,
}


class CanonicalLeaseMetadata(BaseModel):
    lease_id: str
    property_type: PropertyType
    rent: RentStructure
    extension: Extension = Field(discriminator="kind")
    extraction_confidence: float = Field(ge=0.0, le=1.0)
    provenance: str = "lease_abstraction_pipeline_v1"

    @field_validator("rent")
    @classmethod
    def _multifamily_is_monthly(cls, v: RentStructure, info) -> RentStructure:
        if info.data.get("property_type") is PropertyType.MULTIFAMILY and v.frequency != "monthly":
            raise ValueError("Multifamily base rent must normalize to monthly frequency.")
        return v

    @model_validator(mode="after")
    def _extension_matches_type(self) -> "CanonicalLeaseMetadata":
        if self.extension.kind != self.property_type.value:
            raise ValueError(
                f"extension '{self.extension.kind}' does not match "
                f"property_type '{self.property_type.value}'"
            )
        return self


class TypeAwareNormalizer:
    def __init__(self, fallback_threshold: float = 0.75):
        self.fallback_threshold = fallback_threshold
        self.fallback_queue: list[dict] = []

    def normalize(self, raw: dict) -> Union[CanonicalLeaseMetadata, dict]:
        try:
            prop_type = PropertyType(raw.get("property_type", "").strip().lower())
            confidence = float(raw.get("confidence", 0.85))

            if confidence < self.fallback_threshold:
                return self._to_fallback(raw, "Low extraction confidence")

            ext_model = EXTENSION_BY_TYPE[prop_type]
            record = CanonicalLeaseMetadata(
                lease_id=raw["lease_id"],
                property_type=prop_type,
                rent=RentStructure(**raw["rent"]),
                extension=ext_model(**raw.get("extension", {})),
                extraction_confidence=confidence,
            )
            logger.info("Normalized %s (%s)", record.lease_id, prop_type.value)
            return record

        except Exception as exc:  # noqa: BLE001 — quarantine, never crash the batch
            return self._to_fallback(raw, str(exc))

    def _to_fallback(self, raw: dict, reason: str) -> dict:
        self.fallback_queue.append({"lease_id": raw.get("lease_id"), "reason": reason})
        logger.warning("Routed %s to fallback: %s", raw.get("lease_id"), reason)
        return {"status": "queued_for_review", "lease_id": raw.get("lease_id"), "reason": reason}

The discriminated union is what makes this safe: pydantic uses the kind tag to pick exactly one extension model, so an office payload physically cannot validate as a retail extension, and the model_validator rejects any record whose extension disagrees with its declared property type. Common fields live once on the base; type variance lives in exactly one place. A new asset class is one new extension model and one dictionary entry — the base schema and every existing type stay frozen.

Edge cases specific to commercial leases

Mixed portfolios surface failure modes a single-asset pipeline never meets:

Mixed-use and reclassified assets. A ground-floor retail unit inside a multifamily tower carries both a percentage-rent breakpoint and a unit count. Do not invent a fifth enum value per hybrid — model the dominant rent mechanic as the property_type and let the secondary attributes ride in a clearly namespaced sub-block, or the extension matrix combinatorially explodes.
Frequency drift across types. Office and industrial rents are usually quoted annually, multifamily monthly. Normalize to a single canonical frequency at the boundary and store the original, because an annual office figure silently compared against a monthly multifamily one inflates a portfolio roll-up twelve-fold.
Null versus zero in type-specific fields. A retail percentage_rent_rate of 0 is a negotiated gross lease; None means the clause was unreadable. Collapse them and a genuine zero-rate lease is sent down fallback routing logic and overwritten with a portfolio default.
Amendment riders that change asset behavior. A tracked-changes rider can convert a flat-rent suite into a percentage-rent one mid-term. Resolve the governing version by effective date in the lease data models layer before selecting the extension, never by filename order, or the wrong extension model validates clean against stale terms.

When to escalate instead of normalize

The dispatcher should stop and route to a human queue — not pick a default — whenever the property type itself is uncertain or the type-specific value carries direct financial blast radius. Concretely, escalate when: the property-type classifier scores below threshold, because the wrong type selects the wrong extension and every field beneath it is then misinterpreted; the asset is genuinely mixed-use and no single rent mechanic dominates; or a type-specific money field (a retail breakpoint, an office CAM method, a multifamily gross potential rent) arrives malformed or below its per-field confidence threshold. When the low-confidence value traces to a scanned or shifted rider, the fix is upstream — tag it for OCR preprocessing re-extraction rather than substituting a default, so a fixable extraction problem is not masked. Set the fallback threshold at 0.90 or higher on money fields so they reach review eagerly, and externalize the type-dispatch matrix to configuration so ops can adjust routing without a redeploy.

Frequently asked questions

Should I use one flat schema or a schema per property type?

Neither, for a mixed portfolio. Use a canonical base for the fields every lease shares plus one validated extension model per property type. The base keeps portfolio-level queries simple; the extension preserves type-specific fidelity. A flat schema invites silent semantic drift, and N separate schemas force N-way unions on every cross-asset report.

How do I add a new asset class without breaking downstream consumers?

Add one new typed extension model and one entry in the dispatch map. Because the canonical base never changes, existing consumers keep reading the same shared fields, and only systems that care about the new asset class read its extension. This is the whole reason to isolate type variance in a discriminated extensions namespace.

What confidence threshold should trigger manual review here?

Tie it to financial impact. The property-type classifier and any type-specific money field — retail breakpoint, office CAM method, multifamily gross potential rent — should sit at 0.90 or higher so they escalate eagerly, because a wrong type selects the wrong extension and misreads every field below it. Descriptive fields can sit lower and fall back to a default.

How do I handle a mixed-use lease that spans two property types?

Model the dominant rent mechanic as the property_type and carry the secondary attributes in a clearly namespaced sub-block rather than minting a hybrid enum value. If no mechanic dominates, route the record to manual review — combinatorial hybrid types make the extension matrix unmaintainable.

Metadata Normalization Standards — the parent stage defining the canonical schema, coercion layer, and validation gate this dispatcher plugs into.
Fallback Routing for Missing Lease Metadata Fields — what happens to the records this normalizer quarantines as low-confidence.
Handling CAM Charge Variations in Lease Taxonomy Design — the office and industrial reconciliation variance that the typed extensions isolate.
Structuring a Lease Abstraction Database for Multi-Property Portfolios — how the base-plus-extensions payload maps onto storage without flattening type fidelity.

← Back to Metadata Normalization Standards