Assumed audience: Anyone working in or around wealth management who's curious about how risk numbers actually get validated. No prior quant knowledge needed. If you are a quant, there's enough here for you too.
Risk models make a promise: on roughly one trading day in twenty, your portfolio should lose more than the provided value-at-risk (VaR) value. That's a testable claim. Back-testing is how you check it. The problem is that the standard way of back-testing isn't demanding enough — it can pass a model that's systematically too slow to react when markets shift.
This article introduces a more rigorous test developed by Edgelab, explains what it found when applied to widely used methodologies, and draws out what it means for anyone who relies on a VaR number to make decisions.
1. What is VaR and why does testing it matter
VaR is a probability forecast. When a risk model says "the 5% VaR on your portfolio is 2% today," it's making a testable claim: on roughly one trading day in twenty, your portfolio should lose more than VaR. No more frequently, no less frequently.
Like any forecast, you can check it after the fact. Count the days the losses exceed VaR. Compare that count to what the model predicted. If the fraction is around 1/20 (5%), the model passes.
This is called back-testing. It's standard practice and a regulatory requirement in most jurisdictions. It sounds rigorous and it is, but it is not a strong test.
2. The blind spot in standard back-testing
Imagine you wanted to check whether your weather app is actually reliable. One way: count how often it rains on days it said wouldn't rain. If it's wrong about 10% of the time, and it claimed 90% accuracy, that's roughly right. Test passed.
But that test misses something important. What if the app is always a day late? A simple forecast for tomorrow is the weather today and such forecast is always one day late. It predicts sunshine while the storm is already building, then switches to rain warnings after you're already soaked. The count of sunny and rainy days is correct but this forecast is useless precisely when you need it most.
This is the problem with relying only on VaR exceedance count. Moreover, this tests only the 95% level VaR, but a trustworthy risk model should get right the VaR for all frequencies, say for example 90%.
A model can pass the standard back-test — getting the long-run count of exceptions roughly right — while still being slow to react. It keeps showing low risk as volatility is already picking up, then catches up after the market has moved. The annual average looks fine. The moments that matter don't.
One clarification worth making: good risk models don't predict crises. No model reliably can in an era where a single post on social media can throw the market in crisis mode. What they can do is respond quickly once the market starts signaling that conditions are changing. The question is whether a model picks up those signals promptly or lags behind them.
3. What the tile test does differently
Edgelab developed a more demanding test to examine this dimension of model quality.
When a risk model is working correctly, outcomes it predicts shouldn't just balance out on average. They should be spread evenly throughout time. No period should show consistently worse surprises than others. No part of the probability distribution should be systematically underrepresented in any given window.
The tile test checks this by dividing the historical record into windows — tiles — across both time and the probability distribution. Instead of asking "did the model get the full decade roughly right at the 95% level?", it asks "did it perform consistently across every sub-period and for every probability slice?" It's a much stricter question and it surfaces failures the standard test isn't designed to catch.
4. What the results showed
The test was applied to real market data and for two risk horizons — at one-day and ten-days. Several standard methodologies were put through it.
Responsiveness matters more than distributional precision. There's a long-running debate about the statistical shape of return distributions — normal, fat-tailed, skewed and so on. The tile test suggests that while this matters, it matters less than how quickly a model responds to shifting conditions. A model with a slightly imperfect distribution but good responsiveness outperforms a theoretically elegant one that reacts slowly. But nevertheless, a good model must get the distribution correctly.
Long historical windows create a specific kind of lag. A family of widely used models builds risk estimates from extended samples of past returns. More history, more information, more stability. That's the logic. The tile test shows the cost: that stability comes at the expense of responsiveness. When conditions shift, these models are slower to reflect the new reality. This is a known criticism of this class of models. The tile test gives it a quantitative measure of their deficiencies.
Some standard methodologies don't pass. When held to the tile test's standard, several approaches that clear a conventional back-test fall short — not in edge cases, but in ways that that are practically relevant for everyday risk management.
5. What this means for wealth managers
The VaR number on your screen is an output. Behind it is a model. Behind the model are assumptions — about how quickly the estimate should respond to new information, about how much weight to give recent versus older data, about probability distributions. Those assumptions have consequences. And they may never have been tested as rigorously as the tile test can do.
The question worth asking of any risk model — built in-house or provided by a partner — isn't just "does it produce plausible numbers?" It's: when market conditions change, how quickly does this model respond? And how do we know?
At Edgelab, these are questions we can answer. The tile test was developed here and our methodology passes this harsh test successfully. Not as a formality, but as the standard we actually hold ourselves to. A risk estimate that lags reality costs clients money at exactly the wrong time.
Based on the research paper of Gilles Zumbach, "Tile test for back-testing risk evaluation," Quantitative Finance, Vol. 21(10), 1605–1619, 2021.
{{cta}}
P.S. This research examines a single asset in isolation. A portfolio is more complex: assets don't move independently, and how they correlate — especially in stress periods — is a separate and equally important dimension of risk. That's the subject of another piece, which looks at back-testing portfolio-level VaR across multiple assets simulateneously.
Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
- Item 1
- Item 2
- Item 3
Unordered list
- Item A
- Item B
- Item C
Bold text
Emphasis
Superscript
Subscript
