Why AVMs Are a Prior, Not a Valuation

Property valuation is one of the most consequential decisions in finance. Every mortgage, every refinancing, every fund NAV, every disposal, every loan workout sits on top of a number. Get the number wrong and the consequences travel — into capital provisioning, regulatory ratios, investor returns, and individual borrowers. Get it right and the rest of the system holds.

For the last twenty-five years, the industry has been moving the production of that number from humans to machines. Automated Valuation Models — AVMs — now sit at the heart of mortgage origination, portfolio monitoring, and securitisation in nearly every major property market. Hometrack values around 50 million UK properties a year and is used by 17 of the top 20 UK lenders. Zillow publishes Zestimates on 118 million US homes. CoreLogic, HouseCanary, ICE, and First American all run their own models for institutional clients. The economic case is straightforward: an AVM costs cents; a surveyor inspection costs hundreds.

But there is a structural problem hiding inside the AVM industry that the next decade of valuation will need to solve.

AVMs are evaluated on aggregate statistics. They are used on individual assets. Those are not the same job.

This piece sets out why — and what a different system, built on top of AVMs rather than against them, looks like.

1. A short history of how AVMs got here

Modern AVMs descend from the hedonic pricing model, formalised by Sherwin Rosen in 1974, building on earlier work by Court (1939) and Lancaster (1966). The core idea is that a property is a bundle of characteristics — size, location, age, bedrooms, condition — and its price is the sum of the implicit prices of each characteristic. Run an ordinary least squares regression of sale price on those characteristics across a market and you have a valuation engine.

For thirty years, that was the state of the art. Then the data got big and the methods got better.

The literature documents a clear evolution:

Linear hedonic models (1970s–1990s): transparent, interpretable, but assume linearity and additivity in ways that are demonstrably wrong for housing.
Spatial models (2000s): geographically weighted regression and spatial autoregression to capture the fact that location effects don't respect arbitrary boundaries.
Tree-based ensembles (2010s): random forests, gradient boosting, XGBoost. Capture non-linear interactions between hundreds of features. Significantly outperform linear hedonic models on most benchmarks.
Neural networks and deep learning (late 2010s onwards): Zillow's Zestimate now runs on a neural network examining hundreds of data points per home. CoreLogic and HouseCanary use ensemble stacks combining multiple model families.
Multi-modal models (2020s): incorporating computer vision over property photos, satellite imagery, and street-level imagery. Recent work from Santiago and Hong Kong shows image-based features can meaningfully reduce error.

Each generation has been an honest improvement. AVMs today are far more accurate than they were a decade ago. Zillow's reported median error for active listings has dropped from roughly 4–5% in 2020 to under 2% on-market in 2026.

This is not a piece arguing that AVMs are bad. They aren't. They are the most efficient valuation infrastructure in the history of finance.

The argument is about what they fundamentally can and can't do — and where the next layer of value sits.

2. The benchmarking problem: average error vs single-asset decisions

Every AVM in the market is graded on aggregate statistics:

Zillow reports median error rates: 1.9% on-market, ~7% off-market, with regional variation up to 9.8%.
Hometrack reports the percentage of valuations falling within 10% of a surveyor's recommended value.
CoreLogic, HouseCanary, ICE, First American all report variations on the same theme: MAPE (Mean Absolute Percentage Error), median error, % within a band, hit rate, and confidence-tier accuracy.

These are useful metrics for comparing models. They are not useful metrics for trusting any individual valuation.

The academic literature has been explicit about this for years. Demiroglu and James, in a study of US non-agency securitised loans, found AVMs showed pricing errors of 12% to 15% of actual sale price for median quality homes — and the variability was even higher for properties below median quality. Krugery and Maturana found that AVM errors and appraisal errors come from fundamentally different sources: AVM errors are statistical errors and model miscalibration, while appraisal errors are more often moral hazard and behavioural.

Two findings from the last two years sharpen this further:

The "springiness" problem. The American Enterprise Institute's 2024 evaluation of five major US AVM providers found that several models were "springing" toward the listing price when one became available — meaning their tested accuracy in benchmarking exercises was inflated by access to a signal that doesn't exist in real-world use cases like home equity lending or refinancing. AVMetrics, the only independent AVM testing firm in the US, has now stopped allowing list prices in test methodologies for exactly this reason. AEI's combined scores across the five providers ranged from a B (4.1) to a D+ (2.4) — for the same five major institutional AVMs lenders rely on every day.

The unobservability problem. A 2026 paper by Yiu et al. studying the Auckland market introduced a more fundamental point: divergence between AVM estimates and transaction prices is not always valuation error. It can reflect buyer-side heterogeneity that cannot be learned, predicted, or generalised by design. Some of what determines a transaction price — the specific buyer, the specific moment, the specific negotiation — is fundamentally outside what any model can learn from past transactions.

The takeaway: the headline accuracy of an AVM tells you about the population. It tells you almost nothing about your asset.

AVMs hold up in the body of the distribution. The nuance lives in the tail.

And in real estate, the tail is where the decisions get made.

3. The deeper problem: statistical learning vs valuation reasoning

A defender of modern AVMs would say, fairly: "We have hundreds of features. We use gradient boosting and neural networks that learn non-linear interactions. We capture lease length, condition, location effects. The model has seen short leases and restrictive covenants in training."

That's true. And it misses the point.

There are three structural limits that no amount of model sophistication can overcome:

3.1 Statistical learning learns average effects, not specific causal structures

A model trained on millions of transactions can learn that on average, a short lease reduces price by X%. It cannot learn that, for this specific flat, in this specific building, with this specific freeholder, with this specific service charge dispute, the lease is the binding constraint on value today.

The model sees a feature in a row of a table. A valuer sees a situation. The same lease length can mean very different things in different contexts — and the contextual factors are usually not in the training data, by definition.

3.2 Most of what determines value isn't in the data

The standard AVM training set contains structured fields: square footage, bedrooms, transactions, location, age. The standard valuation question — "what is this asset actually worth, today, to a real buyer?" — depends on facts that live elsewhere:

A planning constraint sits in a PDF on a council website.
A restrictive covenant sits in a title deed.
A failed listing six months ago is a behavioural signal, not a price.
"Flat 12 to 14" in a marketing brochure is a fact about a building's physical configuration, not a row in a table.
A new lease extension was granted last week and hasn't propagated to any database.
A neighbouring development has just received planning permission that materially changes the view.

Each of these is recoverable — but only by reading unstructured evidence, interpreting what it means, and connecting it to the specific asset. None of it is in the model's feature vector.

3.3 Statistical models cannot reason about their own inputs

This is the deepest limit, and the one most AVM literature glosses over.

Even if a model had every relevant feature, it has no concept of "this input looks suspicious — I should verify it before relying on it." Checking requires a theory of the asset that exists outside the data. A surveyor checks because they know what an 80 sqm flat should look like, what a building of this age and type should contain, what the typical service charge in this postcode should be. When something doesn't fit, they investigate.

A statistical model has no such theory. It has a fitted surface through nearby points in feature space. If the inputs are wrong, the output is wrong, and the confidence band stays just as tight. The model has no mechanism for self-doubt.

This is not a complaint about model architecture. It is a fundamental property of supervised learning. You cannot train a regression to do something that isn't represented in its loss function — and "is this input internally consistent with the rest of what I know about this asset?" is not a regression problem.

4. What surveyors actually do — and what the RICS Red Book already says

Strip away the romance and a surveyor's job has a clear structure:

Establish the facts of the asset — physical, legal, and contextual — through inspection and documentary evidence.
Apply multiple valuation methods (comparables, investment, residual, depreciated replacement cost) and form a view on which is appropriate and how they relate.
Reconcile disagreement between methods and between data sources.
Apply professional judgement to adjust for what the standard methods miss.
Document the reasoning so it can be challenged, audited, and defended.

This is not memorisation of prices. It is reasoning about an asset.

The regulatory framework already acknowledges this. The RICS Red Book (Global Standards, January 2025 edition) is explicit:

Where artificial intelligence (AI), automated valuation models (AVMs) or valuation calculation software is used, the outputs are only considered to be a written valuation if a valuer has applied their professional judgement to it.

Section VPS 1 of the Red Book, and the corresponding sections of International Valuation Standards (IVS 105.20), position AVMs as inputs to professional judgement rather than substitutes for it. The valuer remains responsible for the opinion. The AVM provides evidence and analysis the valuer can accept, adjust, or override.

This is the regulatory consensus, globally. The 2024 US federal final rule on AVMs — issued by six agencies including the Federal Reserve, OCC, FDIC, and CFPB and effective October 2025 — requires lenders using AVMs to maintain documented policies, vendor oversight, testing, and bias controls. The direction of travel is the same: AVMs are useful, but the system around them needs to be more rigorous, not less.

The industry has already conceded the structural point. AVMs alone are not valuations. They are inputs to valuations.

The question is what does the layer above the AVM look like?

For most of the last twenty-five years, that layer has had to be a human. Until now.

5. Why LLMs change this — for the first time

Until very recently, the only systems capable of doing what surveyors do were surveyors. The components of valuation reasoning — reading unstructured documents, cross-referencing multiple sources, identifying contradictions, applying domain rules, generating auditable explanations — were beyond what any tractable software system could do.

Large language models change this, and the change is not cosmetic.

The capabilities that matter for valuation are:

Reading unstructured evidence at scale. Planning decisions, title deeds, marketing brochures, listing histories, valuation reports, lease documents, EPC certificates. All the things that contain the actual facts of an asset and have always been outside an AVM's reach.
Multi-step reasoning with verification. A growing academic literature (Graph of Verification, VerifiAgent, Chain-of-Verification) is formalising how LLMs can decompose complex tasks, retrieve evidence, check intermediate results, and revise. This is exactly what valuation reasoning requires.
Tool use and structured outputs. An LLM agent can call an AVM as one tool, a comparables database as another, a planning system as a third — and synthesise the results, including handling disagreement.
Auditable working. Unlike a regression's confidence band, an agent can produce a written reasoning trace: which sources were consulted, which were trusted, which were demoted, and why. This is what the RICS Red Book has always required from human valuers and what statistical AVMs have never been able to provide.

This isn't theoretical. Multi-agent reasoning systems are now being applied seriously in clinical diagnosis, financial analysis, and real estate transactions. Beike's REAL benchmark (2025) is the first systematic evaluation of LLMs on housing transaction tasks. Recent work on REIT trading agents demonstrates closed-loop multi-agent systems making investment decisions. The infrastructure exists, the academic foundation is being laid, and the engineering is tractable.

The relevant claim is not that LLMs are smarter than statistical models. They aren't, on most prediction tasks. The relevant claim is that they can do the part of valuation that statistical models structurally cannot: reason about a specific asset.

6. AVMs as a prior, agents as the valuer

The architecture this points to is straightforward:

AVMs remain. They become priors, not answers.

The statistical model has compressed millions of transactions into a baseline expectation for a given asset. That is genuinely valuable — it is the strongest single starting point in the valuation. But it is a starting point, not a conclusion.

An agent does the work above the prior. Specifically:

Pulls evidence the AVM couldn't see — title, planning, lease data, recent transactions, listing behaviour.
Reads behavioural signals — failed sales, relistings, withdrawals, time on market.
Cross-checks the AVM's inputs against other sources — does the floor area on the AVM match the floor area on the EPC, the marketing material, and the building plans?
Runs multiple valuation methods in parallel — comparables, investment yield, residual — and explicitly handles disagreement instead of averaging it away.
Adjusts the prior for what the standard data couldn't see and what the model couldn't reason about.
Produces a written reasoning trace that a human can challenge, audit, and override.

This is, structurally, what a chartered surveyor does. The agent doesn't replace the surveyor — it does the data-heavy, mechanical, evidentiary work that consumes most of a surveyor's time on each instruction, allowing the human judgement to focus where it actually matters.

7. What this means for the industry

Three implications follow.

For lenders. The compliance environment is converging on the view that AVMs alone are insufficient — both the new US federal rule and the RICS Red Book require professional judgement above the model. Today, that judgement has to come from a human surveyor, which is expensive and slow. An agent layer between the AVM and the surveyor can dramatically reduce the per-decision cost while improving the quality of evidence the surveyor sees. The economics flip from "AVM or surveyor" to "AVM, agent, and surveyor in sequence."

For surveyors. Valuation is not under threat from AI. The mechanical parts of the job are. The surveyor of the next decade spends less time gathering comparables and more time exercising the judgement that the regulatory framework already says is the irreducible core of the work. The professional opinion remains the product. The cost structure of producing it changes.

For fund underwriters and asset managers. The current state is a binary: full appraisal (slow, expensive, comprehensive) or AVM (fast, cheap, blunt). The middle ground — fast, cheap, and asset-specific — has not existed. This is the gap an agentic valuation layer fills, and it materially changes underwriting workflows for portfolio acquisition, refinancing, and ongoing monitoring.

8. Conclusion

The next decade of property valuation isn't a better model.

Statistical learning, in all its forms — hedonic regression, gradient boosting, neural networks, multi-modal ensembles — has been getting better for fifty years. It will continue to get better. Median errors will keep falling. New data sources will keep arriving. None of that will change the structural mismatch.

AVMs are graded on aggregate statistics. Decisions get made on individual assets.

ML learns the average asset. Valuation is about the specific asset.

The gap between them is what surveyors have always been paid for — and it's the gap LLMs can finally close.

AVMs are not dead. They are not even diminished. They are exactly what they have always been: the most efficient way ever invented to compute a baseline expectation for a property's value. They are a strong prior.

But the valuer is, and has always been, the layer of reasoning that sits above the prior. For twenty-five years, that layer has been a human. For the next twenty-five, it will be a system that combines the statistical baseline with the kind of asset-specific reasoning that only humans could previously do.

That is the shift the industry hasn't priced in yet.

AVMs give you a number. The agent tells you whether to trust it.

Written by the team at UrbanCode, where we are building the agentic valuation layer for global property markets. If you are in lending, surveying, or fund underwriting and this resonates, we would like to hear from you.

Selected references

Rosen, S. (1974). Hedonic Prices and Implicit Markets. Journal of Political Economy.
Demiroglu, C. & James, C. AVM pricing errors and appraisal bias. Cited in Ignoring Spatial and Spatiotemporal Dependence in the Disturbances Can Make Black Swans Appear Grey (NCBI / PMC8011052).
Krugery & Maturana. Sources of error in AVMs vs appraisals. Cited in same.
Yiu et al. (2026). Why Market Prices May Not Be the Best Benchmark for Automated Valuation Models. MDPI International Journal of Financial Studies.
AEI Housing Center (2024). Results on the Evaluation of AVM Providers. American Enterprise Institute.
AVMetrics (2024–2025). Whitepapers on AVM testing methodology and "springiness".
RICS (2025). Valuation – Global Standards (Red Book), January 2025 edition.
US Federal Agencies (2024). Final Rule on Quality Control Standards for Automated Valuation Models. Effective October 2025.
Hong Kong / Santiago / Auckland comparative AVM studies (PLOS One, 2025).
Beike Inc. (2025). REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services.
Liang et al. (2024). Chain-of-Verification. Hong et al. (2024). Detecting Logical Errors in LLMs. Stechly et al. (2024). Self-Critique in LLM Reasoning.

But there is a structural problem hiding inside the AVM industry that the next decade of valuation will need to solve.

AVMs are evaluated on aggregate statistics. They are used on individual assets. Those are not the same job.

This piece sets out why — and what a different system, built on top of AVMs rather than against them, looks like.

1. A short history of how AVMs got here

For thirty years, that was the state of the art. Then the data got big and the methods got better.

The literature documents a clear evolution:

Linear hedonic models (1970s–1990s): transparent, interpretable, but assume linearity and additivity in ways that are demonstrably wrong for housing.
Spatial models (2000s): geographically weighted regression and spatial autoregression to capture the fact that location effects don't respect arbitrary boundaries.
Tree-based ensembles (2010s): random forests, gradient boosting, XGBoost. Capture non-linear interactions between hundreds of features. Significantly outperform linear hedonic models on most benchmarks.
Neural networks and deep learning (late 2010s onwards): Zillow's Zestimate now runs on a neural network examining hundreds of data points per home. CoreLogic and HouseCanary use ensemble stacks combining multiple model families.
Multi-modal models (2020s): incorporating computer vision over property photos, satellite imagery, and street-level imagery. Recent work from Santiago and Hong Kong shows image-based features can meaningfully reduce error.

This is not a piece arguing that AVMs are bad. They aren't. They are the most efficient valuation infrastructure in the history of finance.

The argument is about what they fundamentally can and can't do — and where the next layer of value sits.

2. The benchmarking problem: average error vs single-asset decisions

Every AVM in the market is graded on aggregate statistics:

Zillow reports median error rates: 1.9% on-market, ~7% off-market, with regional variation up to 9.8%.
Hometrack reports the percentage of valuations falling within 10% of a surveyor's recommended value.
CoreLogic, HouseCanary, ICE, First American all report variations on the same theme: MAPE (Mean Absolute Percentage Error), median error, % within a band, hit rate, and confidence-tier accuracy.

These are useful metrics for comparing models. They are not useful metrics for trusting any individual valuation.

Two findings from the last two years sharpen this further:

The takeaway: the headline accuracy of an AVM tells you about the population. It tells you almost nothing about your asset.

AVMs hold up in the body of the distribution. The nuance lives in the tail.

And in real estate, the tail is where the decisions get made.

3. The deeper problem: statistical learning vs valuation reasoning

That's true. And it misses the point.

There are three structural limits that no amount of model sophistication can overcome:

3.1 Statistical learning learns average effects, not specific causal structures

3.2 Most of what determines value isn't in the data

A planning constraint sits in a PDF on a council website.
A restrictive covenant sits in a title deed.
A failed listing six months ago is a behavioural signal, not a price.
"Flat 12 to 14" in a marketing brochure is a fact about a building's physical configuration, not a row in a table.
A new lease extension was granted last week and hasn't propagated to any database.
A neighbouring development has just received planning permission that materially changes the view.

Each of these is recoverable — but only by reading unstructured evidence, interpreting what it means, and connecting it to the specific asset. None of it is in the model's feature vector.

3.3 Statistical models cannot reason about their own inputs

This is the deepest limit, and the one most AVM literature glosses over.

4. What surveyors actually do — and what the RICS Red Book already says

Strip away the romance and a surveyor's job has a clear structure:

Establish the facts of the asset — physical, legal, and contextual — through inspection and documentary evidence.
Apply multiple valuation methods (comparables, investment, residual, depreciated replacement cost) and form a view on which is appropriate and how they relate.
Reconcile disagreement between methods and between data sources.
Apply professional judgement to adjust for what the standard methods miss.
Document the reasoning so it can be challenged, audited, and defended.

This is not memorisation of prices. It is reasoning about an asset.

The regulatory framework already acknowledges this. The RICS Red Book (Global Standards, January 2025 edition) is explicit:

Where artificial intelligence (AI), automated valuation models (AVMs) or valuation calculation software is used, the outputs are only considered to be a written valuation if a valuer has applied their professional judgement to it.

The industry has already conceded the structural point. AVMs alone are not valuations. They are inputs to valuations.

The question is what does the layer above the AVM look like?

For most of the last twenty-five years, that layer has had to be a human. Until now.

5. Why LLMs change this — for the first time

Large language models change this, and the change is not cosmetic.

The capabilities that matter for valuation are:

Reading unstructured evidence at scale. Planning decisions, title deeds, marketing brochures, listing histories, valuation reports, lease documents, EPC certificates. All the things that contain the actual facts of an asset and have always been outside an AVM's reach.
Multi-step reasoning with verification. A growing academic literature (Graph of Verification, VerifiAgent, Chain-of-Verification) is formalising how LLMs can decompose complex tasks, retrieve evidence, check intermediate results, and revise. This is exactly what valuation reasoning requires.
Tool use and structured outputs. An LLM agent can call an AVM as one tool, a comparables database as another, a planning system as a third — and synthesise the results, including handling disagreement.
Auditable working. Unlike a regression's confidence band, an agent can produce a written reasoning trace: which sources were consulted, which were trusted, which were demoted, and why. This is what the RICS Red Book has always required from human valuers and what statistical AVMs have never been able to provide.

6. AVMs as a prior, agents as the valuer

The architecture this points to is straightforward:

AVMs remain. They become priors, not answers.

An agent does the work above the prior. Specifically:

Pulls evidence the AVM couldn't see — title, planning, lease data, recent transactions, listing behaviour.
Reads behavioural signals — failed sales, relistings, withdrawals, time on market.
Cross-checks the AVM's inputs against other sources — does the floor area on the AVM match the floor area on the EPC, the marketing material, and the building plans?
Runs multiple valuation methods in parallel — comparables, investment yield, residual — and explicitly handles disagreement instead of averaging it away.
Adjusts the prior for what the standard data couldn't see and what the model couldn't reason about.
Produces a written reasoning trace that a human can challenge, audit, and override.

7. What this means for the industry

Three implications follow.

8. Conclusion

The next decade of property valuation isn't a better model.

AVMs are graded on aggregate statistics. Decisions get made on individual assets.

ML learns the average asset. Valuation is about the specific asset.

The gap between them is what surveyors have always been paid for — and it's the gap LLMs can finally close.

That is the shift the industry hasn't priced in yet.

AVMs give you a number. The agent tells you whether to trust it.

Selected references

Rosen, S. (1974). Hedonic Prices and Implicit Markets. Journal of Political Economy.
Demiroglu, C. & James, C. AVM pricing errors and appraisal bias. Cited in Ignoring Spatial and Spatiotemporal Dependence in the Disturbances Can Make Black Swans Appear Grey (NCBI / PMC8011052).
Krugery & Maturana. Sources of error in AVMs vs appraisals. Cited in same.
Yiu et al. (2026). Why Market Prices May Not Be the Best Benchmark for Automated Valuation Models. MDPI International Journal of Financial Studies.
AEI Housing Center (2024). Results on the Evaluation of AVM Providers. American Enterprise Institute.
AVMetrics (2024–2025). Whitepapers on AVM testing methodology and "springiness".
RICS (2025). Valuation – Global Standards (Red Book), January 2025 edition.
US Federal Agencies (2024). Final Rule on Quality Control Standards for Automated Valuation Models. Effective October 2025.
Hong Kong / Santiago / Auckland comparative AVM studies (PLOS One, 2025).
Beike Inc. (2025). REAL: Benchmarking Abilities of Large Language Models for Housing Transactions and Services.
Liang et al. (2024). Chain-of-Verification. Hong et al. (2024). Detecting Logical Errors in LLMs. Stechly et al. (2024). Self-Critique in LLM Reasoning.

Why AVMs Are a Prior, Not a Valuation

1. A short history of how AVMs got here

2. The benchmarking problem: average error vs single-asset decisions

3. The deeper problem: statistical learning vs valuation reasoning

3.1 Statistical learning learns average effects, not specific causal structures

3.2 Most of what determines value isn't in the data

3.3 Statistical models cannot reason about their own inputs

4. What surveyors actually do — and what the RICS Red Book already says

5. Why LLMs change this — for the first time

6. AVMs as a prior, agents as the valuer

7. What this means for the industry

8. Conclusion

Selected references

Related Articles

Stay Updated

Why AVMs Are a Prior, Not a Valuation

1. A short history of how AVMs got here

2. The benchmarking problem: average error vs single-asset decisions

3. The deeper problem: statistical learning vs valuation reasoning

3.1 Statistical learning learns average effects, not specific causal structures

3.2 Most of what determines value isn't in the data

3.3 Statistical models cannot reason about their own inputs

4. What surveyors actually do — and what the RICS Red Book already says

5. Why LLMs change this — for the first time

6. AVMs as a prior, agents as the valuer

7. What this means for the industry

8. Conclusion

Selected references

Related Articles

Stay Updated