We use cookies to enhance your experience and measure how the site performs. Choose "Essential Only" to disable analytics. Read our Privacy Policy.
The structural limits of statistical valuation, and what comes next

Property valuation is one of the most consequential decisions in finance. Every mortgage, every refinancing, every fund NAV, every disposal, every loan workout sits on top of a number. Get the number wrong and the consequences travel — into capital provisioning, regulatory ratios, investor returns, and individual borrowers. Get it right and the rest of the system holds.
For the last twenty-five years, the industry has been moving the production of that number from humans to machines. Automated Valuation Models — AVMs — now sit at the heart of mortgage origination, portfolio monitoring, and securitisation in nearly every major property market. Hometrack values around 50 million UK properties a year and is used by 17 of the top 20 UK lenders. Zillow publishes Zestimates on 118 million US homes. CoreLogic, HouseCanary, ICE, and First American all run their own models for institutional clients. The economic case is straightforward: an AVM costs cents; a surveyor inspection costs hundreds.
But there is a structural problem hiding inside the AVM industry that the next decade of valuation will need to solve.
AVMs are evaluated on aggregate statistics. They are used on individual assets. Those are not the same job.
This piece sets out why — and what a different system, built on top of AVMs rather than against them, looks like.
Modern AVMs descend from the hedonic pricing model, formalised by Sherwin Rosen in 1974, building on earlier work by Court (1939) and Lancaster (1966). The core idea is that a property is a bundle of characteristics — size, location, age, bedrooms, condition — and its price is the sum of the implicit prices of each characteristic. Run an ordinary least squares regression of sale price on those characteristics across a market and you have a valuation engine.
For thirty years, that was the state of the art. Then the data got big and the methods got better.
The literature documents a clear evolution:
Each generation has been an honest improvement. AVMs today are far more accurate than they were a decade ago. Zillow's reported median error for active listings has dropped from roughly 4–5% in 2020 to under 2% on-market in 2026.
This is not a piece arguing that AVMs are bad. They aren't. They are the most efficient valuation infrastructure in the history of finance.
The argument is about what they fundamentally can and can't do — and where the next layer of value sits.
Every AVM in the market is graded on aggregate statistics:
These are useful metrics for comparing models. They are not useful metrics for trusting any individual valuation.
The academic literature has been explicit about this for years. Demiroglu and James, in a study of US non-agency securitised loans, found AVMs showed pricing errors of 12% to 15% of actual sale price for median quality homes — and the variability was even higher for properties below median quality. Krugery and Maturana found that AVM errors and appraisal errors come from fundamentally different sources: AVM errors are statistical errors and model miscalibration, while appraisal errors are more often moral hazard and behavioural.
Two findings from the last two years sharpen this further:
The "springiness" problem. The American Enterprise Institute's 2024 evaluation of five major US AVM providers found that several models were "springing" toward the listing price when one became available — meaning their tested accuracy in benchmarking exercises was inflated by access to a signal that doesn't exist in real-world use cases like home equity lending or refinancing. AVMetrics, the only independent AVM testing firm in the US, has now stopped allowing list prices in test methodologies for exactly this reason. AEI's combined scores across the five providers ranged from a B (4.1) to a D+ (2.4) — for the same five major institutional AVMs lenders rely on every day.
The unobservability problem. A 2026 paper by Yiu et al. studying the Auckland market introduced a more fundamental point: divergence between AVM estimates and transaction prices is not always valuation error. It can reflect buyer-side heterogeneity that cannot be learned, predicted, or generalised by design. Some of what determines a transaction price — the specific buyer, the specific moment, the specific negotiation — is fundamentally outside what any model can learn from past transactions.
The takeaway: the headline accuracy of an AVM tells you about the population. It tells you almost nothing about your asset.
AVMs hold up in the body of the distribution. The nuance lives in the tail.
And in real estate, the tail is where the decisions get made.
A defender of modern AVMs would say, fairly: "We have hundreds of features. We use gradient boosting and neural networks that learn non-linear interactions. We capture lease length, condition, location effects. The model has seen short leases and restrictive covenants in training."
That's true. And it misses the point.
There are three structural limits that no amount of model sophistication can overcome:
A model trained on millions of transactions can learn that on average, a short lease reduces price by X%. It cannot learn that, for this specific flat, in this specific building, with this specific freeholder, with this specific service charge dispute, the lease is the binding constraint on value today.
The model sees a feature in a row of a table. A valuer sees a situation. The same lease length can mean very different things in different contexts — and the contextual factors are usually not in the training data, by definition.
The standard AVM training set contains structured fields: square footage, bedrooms, transactions, location, age. The standard valuation question — "what is this asset actually worth, today, to a real buyer?" — depends on facts that live elsewhere:
Each of these is recoverable — but only by reading unstructured evidence, interpreting what it means, and connecting it to the specific asset. None of it is in the model's feature vector.
This is the deepest limit, and the one most AVM literature glosses over.
Even if a model had every relevant feature, it has no concept of "this input looks suspicious — I should verify it before relying on it." Checking requires a theory of the asset that exists outside the data. A surveyor checks because they know what an 80 sqm flat should look like, what a building of this age and type should contain, what the typical service charge in this postcode should be. When something doesn't fit, they investigate.
A statistical model has no such theory. It has a fitted surface through nearby points in feature space. If the inputs are wrong, the output is wrong, and the confidence band stays just as tight. The model has no mechanism for self-doubt.
This is not a complaint about model architecture. It is a fundamental property of supervised learning. You cannot train a regression to do something that isn't represented in its loss function — and "is this input internally consistent with the rest of what I know about this asset?" is not a regression problem.
Strip away the romance and a surveyor's job has a clear structure:
This is not memorisation of prices. It is reasoning about an asset.
The regulatory framework already acknowledges this. The RICS Red Book (Global Standards, January 2025 edition) is explicit:
Where artificial intelligence (AI), automated valuation models (AVMs) or valuation calculation software is used, the outputs are only considered to be a written valuation if a valuer has applied their professional judgement to it.
Section VPS 1 of the Red Book, and the corresponding sections of International Valuation Standards (IVS 105.20), position AVMs as inputs to professional judgement rather than substitutes for it. The valuer remains responsible for the opinion. The AVM provides evidence and analysis the valuer can accept, adjust, or override.
This is the regulatory consensus, globally. The 2024 US federal final rule on AVMs — issued by six agencies including the Federal Reserve, OCC, FDIC, and CFPB and effective October 2025 — requires lenders using AVMs to maintain documented policies, vendor oversight, testing, and bias controls. The direction of travel is the same: AVMs are useful, but the system around them needs to be more rigorous, not less.
The industry has already conceded the structural point. AVMs alone are not valuations. They are inputs to valuations.
The question is what does the layer above the AVM look like?
For most of the last twenty-five years, that layer has had to be a human. Until now.
Until very recently, the only systems capable of doing what surveyors do were surveyors. The components of valuation reasoning — reading unstructured documents, cross-referencing multiple sources, identifying contradictions, applying domain rules, generating auditable explanations — were beyond what any tractable software system could do.
Large language models change this, and the change is not cosmetic.
The capabilities that matter for valuation are:
This isn't theoretical. Multi-agent reasoning systems are now being applied seriously in clinical diagnosis, financial analysis, and real estate transactions. Beike's REAL benchmark (2025) is the first systematic evaluation of LLMs on housing transaction tasks. Recent work on REIT trading agents demonstrates closed-loop multi-agent systems making investment decisions. The infrastructure exists, the academic foundation is being laid, and the engineering is tractable.
The relevant claim is not that LLMs are smarter than statistical models. They aren't, on most prediction tasks. The relevant claim is that they can do the part of valuation that statistical models structurally cannot: reason about a specific asset.
The architecture this points to is straightforward:
AVMs remain. They become priors, not answers.
The statistical model has compressed millions of transactions into a baseline expectation for a given asset. That is genuinely valuable — it is the strongest single starting point in the valuation. But it is a starting point, not a conclusion.
An agent does the work above the prior. Specifically:
This is, structurally, what a chartered surveyor does. The agent doesn't replace the surveyor — it does the data-heavy, mechanical, evidentiary work that consumes most of a surveyor's time on each instruction, allowing the human judgement to focus where it actually matters.
Three implications follow.
For lenders. The compliance environment is converging on the view that AVMs alone are insufficient — both the new US federal rule and the RICS Red Book require professional judgement above the model. Today, that judgement has to come from a human surveyor, which is expensive and slow. An agent layer between the AVM and the surveyor can dramatically reduce the per-decision cost while improving the quality of evidence the surveyor sees. The economics flip from "AVM or surveyor" to "AVM, agent, and surveyor in sequence."
For surveyors. Valuation is not under threat from AI. The mechanical parts of the job are. The surveyor of the next decade spends less time gathering comparables and more time exercising the judgement that the regulatory framework already says is the irreducible core of the work. The professional opinion remains the product. The cost structure of producing it changes.
For fund underwriters and asset managers. The current state is a binary: full appraisal (slow, expensive, comprehensive) or AVM (fast, cheap, blunt). The middle ground — fast, cheap, and asset-specific — has not existed. This is the gap an agentic valuation layer fills, and it materially changes underwriting workflows for portfolio acquisition, refinancing, and ongoing monitoring.
The next decade of property valuation isn't a better model.
Statistical learning, in all its forms — hedonic regression, gradient boosting, neural networks, multi-modal ensembles — has been getting better for fifty years. It will continue to get better. Median errors will keep falling. New data sources will keep arriving. None of that will change the structural mismatch.
AVMs are graded on aggregate statistics. Decisions get made on individual assets.
ML learns the average asset. Valuation is about the specific asset.
The gap between them is what surveyors have always been paid for — and it's the gap LLMs can finally close.
AVMs are not dead. They are not even diminished. They are exactly what they have always been: the most efficient way ever invented to compute a baseline expectation for a property's value. They are a strong prior.
But the valuer is, and has always been, the layer of reasoning that sits above the prior. For twenty-five years, that layer has been a human. For the next twenty-five, it will be a system that combines the statistical baseline with the kind of asset-specific reasoning that only humans could previously do.
That is the shift the industry hasn't priced in yet.
AVMs give you a number. The agent tells you whether to trust it.
Written by the team at UrbanCode, where we are building the agentic valuation layer for global property markets. If you are in lending, surveying, or fund underwriting and this resonates, we would like to hear from you.
You might also like these articles
No related articles found.
Get our latest insights delivered straight to your inbox.